python - scrapy, how to separate text within a HTML tag element -

code containing data:

        <div id="content"><!-- instancebegineditable name="editregion3" -->       <div id="content_div">     <div class="title" id="content_title_div"><img src="img/banner_outlets.jpg" width="920" height="157" alt="outlets" /></div>     <div id="menu_list"> <table border="0" cellpadding="5" cellspacing="5" width="100%">     <tbody>         <tr>             <td valign="top">                 <p>                     <span class="foodtitle">century square</span><br />                     2 tampines central 5<br />                     #01-44-47 century square<br />                     singapore 529509</p>                 <p>                     <br />                     <strong>opening hours:</strong><br />                     7am 12am (sun-thu &amp;&nbsp;ph)<br />                     24 hours (fri &amp; sat&nbsp;&amp;</p>                 <p>                     eve of ph)<br />                     telephone: 6789 0457</p>             </td>             <td valign="top">                 <img alt="century square" src="/assets/images/outlets/century_sq.jpg" style="width: 260px; height: 140px" /></td>             <td valign="top">                 <span class="foodtitle">liat towers</span><br />                 541 liat towers #01-01<br />                 orchard road<br />                 singapore 238888<br />                 <br />                 <strong>opening hours: </strong><br />                 24 hours (daily)<br />                 <br />                 telephone: 6737 8036</td>             <td valign="top">                 <img alt="liat towers" src="/assets/images/outlets/century_liat.jpg" style="width: 260px; height: 140px" /></td>         </tr>

**i want get

place name: century square, liat towers

address : 2 tampines central 5, 541 liat towers #01-01

postal code: singapore 529509, singapore 238888

opening hours: 7-12am, 24 hours daily**

for example:

the first <"p> in '<"td valign="top">' have 3 data want (name,adress,postal). how split them?

here spider code:

from scrapy.spider import basespider scrapy.selector import htmlxpathselector import re todo.items import wendyitem  class wendyspider(basespider):     name = "wendyspider"     allowed_domains = ["wendys.com.sg"]     start_urls = ["http://www.wendys.com.sg/outlets.php"]      def parse(self, response):         hxs = htmlxpathselector(response)         values = hxs.select('//td')         items = []         value in values:             item = wendyitem()             item['name'] = value.select('//span[@class="foodtitle"]/text()').extract()             item['address'] = value.select().extract()             item['postal'] = value.select().extract()             item['hours'] = value.select().extract()             item['contact'] = value.select().extract()             items.append(item)         return items

i select <td valign="top"> contain <span class="foodtitle">

//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodtitle"]]

and each 1 of these td cell, text nodes

.//text()

you that:

['\n                ',  '\n                    ',  'century square',  '\n                    2 tampines central 5',  '\n                    #01-44-47 century square',  '\n                    singapore 529509',  '\n                ',  '\n                    ',  'opening hours:',  u'\n                    7am 12am (sun-thu &\xa0ph)',  u'\n                    24 hours (fri & sat\xa0&',  '\n                ',  '\n                    eve of ph)',  '\n                    telephone: 6789 0457',  '\n            ']

and

['\n                ',  'liat towers',  '\n                541 liat towers #01-01',  '\n                orchard road',  '\n                singapore 238888',  'opening hours: ',  '\n                24 hours (daily)',  '\n                telephone: 6737 8036']

some of these text node have string representation whitespace, strip them , "opening hours" , "telephone" keywords process lines in loop:

from scrapy.spider import basespider scrapy.selector import htmlxpathselector import re todo.items import wendyitem  class wendyspider(basespider):     name = "wendyspider"     allowed_domains = ["wendys.com.sg"]     start_urls = ["http://www.wendys.com.sg/outlets.php"]      def parse(self, response):         hxs = htmlxpathselector(response)         cells = hxs.select('//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodtitle"]]')         items = []         cell in cells:             item = wendyitem()              # text nodes             # lines blank .strip() them             lines = cell.select('.//text()').extract()             lines = [l.strip() l in lines if l.strip()]              # first non-blank line place name             item['name'] = lines.pop(0)              # other lines, check "opening hours" , "telephone"             # store lines in correct list container              address_lines = []             hours_lines = []             telephone_lines = []              opening_hours = false             telephone = false              line in lines:                 if 'opening hours' in line:                     opening_hours = true                 elif 'telephone' in line:                     telephone = true                 if telephone:                     telephone_lines.append(line)                 elif opening_hours:                     hours_lines.append(line)                 else:                     address_lines.append(line)              # last address line postal code + town name             item['address'] = "\n".join(address_lines[:-1])             item['postal'] = address_lines[-1]              # ommit "opening hours" (first element in list)             item['hours'] = "\n".join(hours_lines[1:])              item['contact'] = "\n".join(telephone_lines)              items.append(item)          return items

Search This Blog

Brazell

python - scrapy, how to separate text within a HTML tag element -

Comments

Post a Comment

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

IIS->Tomcat Redirect: multiple worker with default -