python - Missing parts on Beautiful Soup results -

i trying retrieve few <p> tags in following html code. here part of it

<td class="eelantext">     <a class="fblacklink"></a>     <center></center>     <span> … </span><br></br>     <table width="402" vspace="5" cellspacing="0" cellpadding="3"          border="0" bgcolor="#ffffff" align="left">     <tbody> … </tbody></table>       <!--edstart-->     <p> … </p>     <p> … </p>     <p> … </p>     <p> … </p>     <p> … </p> </td>

you can find webpage here

my python code following

soup = beautifulsoup(page) div = soup.find('td', attrs={'class': 'eelantext'}) print div text = div.find_all('p')

but text variable empty , if print div variable, have same html above except <p> tags.

beautifulsoup can use different parsers handle html input. html input here little broken, , default htmlparser parser doesn't handle well.

use html5lib parser instead:

>>> len(beautifulsoup(r.text, 'html').find('td', attrs={'class': 'eelantext'}).find_all('p')) 0 >>> len(beautifulsoup(r.text, 'lxml').find('td', attrs={'class': 'eelantext'}).find_all('p')) 0 >>> len(beautifulsoup(r.text, 'html5lib').find('td', attrs={'class': 'eelantext'}).find_all('p')) 22

Search This Blog

Brazell

python - Missing parts on Beautiful Soup results -

Comments

Post a Comment

Popular posts from this blog

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

html - How to style widget with post count different than without post count -

url rewriting - How to redirect a http POST with urlrewritefilter -