python - scrapy, how to separate text within a HTML tag element -
code containing data:
<div id="content"><!-- instancebegineditable name="editregion3" --> <div id="content_div"> <div class="title" id="content_title_div"><img src="img/banner_outlets.jpg" width="920" height="157" alt="outlets" /></div> <div id="menu_list"> <table border="0" cellpadding="5" cellspacing="5" width="100%"> <tbody> <tr> <td valign="top"> <p> <span class="foodtitle">century square</span><br /> 2 tampines central 5<br /> #01-44-47 century square<br /> singapore 529509</p> <p> <br /> <strong>opening hours:</strong><br /> 7am 12am (sun-thu & ph)<br /> 24 hours (fri & sat &</p> <p> eve of ph)<br /> telephone: 6789 0457</p> </td> <td valign="top"> <img alt="century square" src="/assets/images/outlets/century_sq.jpg" style="width: 260px; height: 140px" /></td> <td valign="top"> <span class="foodtitle">liat towers</span><br /> 541 liat towers #01-01<br /> orchard road<br /> singapore 238888<br /> <br /> <strong>opening hours: </strong><br /> 24 hours (daily)<br /> <br /> telephone: 6737 8036</td> <td valign="top"> <img alt="liat towers" src="/assets/images/outlets/century_liat.jpg" style="width: 260px; height: 140px" /></td> </tr>
**i want get
place name: century square, liat towers
address : 2 tampines central 5, 541 liat towers #01-01
postal code: singapore 529509, singapore 238888
opening hours: 7-12am, 24 hours daily**
for example:
the first <"p> in '<"td valign="top">' have 3 data want (name,adress,postal). how split them?
here spider code:
from scrapy.spider import basespider scrapy.selector import htmlxpathselector import re todo.items import wendyitem class wendyspider(basespider): name = "wendyspider" allowed_domains = ["wendys.com.sg"] start_urls = ["http://www.wendys.com.sg/outlets.php"] def parse(self, response): hxs = htmlxpathselector(response) values = hxs.select('//td') items = [] value in values: item = wendyitem() item['name'] = value.select('//span[@class="foodtitle"]/text()').extract() item['address'] = value.select().extract() item['postal'] = value.select().extract() item['hours'] = value.select().extract() item['contact'] = value.select().extract() items.append(item) return items
i select <td valign="top">
contain <span class="foodtitle">
//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodtitle"]]
and each 1 of these td
cell, text nodes
.//text()
you that:
['\n ', '\n ', 'century square', '\n 2 tampines central 5', '\n #01-44-47 century square', '\n singapore 529509', '\n ', '\n ', 'opening hours:', u'\n 7am 12am (sun-thu &\xa0ph)', u'\n 24 hours (fri & sat\xa0&', '\n ', '\n eve of ph)', '\n telephone: 6789 0457', '\n ']
and
['\n ', 'liat towers', '\n 541 liat towers #01-01', '\n orchard road', '\n singapore 238888', 'opening hours: ', '\n 24 hours (daily)', '\n telephone: 6737 8036']
some of these text node have string representation whitespace, strip them , "opening hours" , "telephone" keywords process lines in loop:
from scrapy.spider import basespider scrapy.selector import htmlxpathselector import re todo.items import wendyitem class wendyspider(basespider): name = "wendyspider" allowed_domains = ["wendys.com.sg"] start_urls = ["http://www.wendys.com.sg/outlets.php"] def parse(self, response): hxs = htmlxpathselector(response) cells = hxs.select('//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodtitle"]]') items = [] cell in cells: item = wendyitem() # text nodes # lines blank .strip() them lines = cell.select('.//text()').extract() lines = [l.strip() l in lines if l.strip()] # first non-blank line place name item['name'] = lines.pop(0) # other lines, check "opening hours" , "telephone" # store lines in correct list container address_lines = [] hours_lines = [] telephone_lines = [] opening_hours = false telephone = false line in lines: if 'opening hours' in line: opening_hours = true elif 'telephone' in line: telephone = true if telephone: telephone_lines.append(line) elif opening_hours: hours_lines.append(line) else: address_lines.append(line) # last address line postal code + town name item['address'] = "\n".join(address_lines[:-1]) item['postal'] = address_lines[-1] # ommit "opening hours" (first element in list) item['hours'] = "\n".join(hours_lines[1:]) item['contact'] = "\n".join(telephone_lines) items.append(item) return items
Comments
Post a Comment