Twisted
 
抓网页,它有优秀的非同步事件驱动的架构,常见的协定都已经有实做,包括HTTP、SMTP等等
 
 
getPage("http://www.google.com").addCallback(printPage)
 
一行就可以抓网页
 
 
 
lxml
 
效率高,支持xpath
 
 
def getNextPageLink(self, tree):
 
"""Get next page link
 
 
@param tree: tree to get link
 
@return: Return url of next page, if there is no next page, return None
 
"""
 
paging = tree.xpath("//span[@class='paging']")
 
if paging:
 
links = paging[0].xpath("./a[(text(), '%s')]" % self.localText['next'])
 
if links:
 
return str(links[0].get('href'))
 
return None
 
listPrice = tree.xpath("//*[@class='priceBlockLabel']/following-sibling::*")
 
if listPrice:
 
detail['listPrice'] = self.stripMoney(listPrice[0].text)
 
 
使用的工具
 
FireFox的插件,XPath checker等xpath的工具,可以先用它来确定抓到的元素是正确的,然后FireBug在检视网页结构










