python - Missing parts on Beautiful Soup results -
i trying retrieve few <p>
tags in following html code. here part of it
<td class="eelantext"> <a class="fblacklink"></a> <center></center> <span> … </span><br></br> <table width="402" vspace="5" cellspacing="0" cellpadding="3" border="0" bgcolor="#ffffff" align="left"> <tbody> … </tbody></table> <!--edstart--> <p> … </p> <p> … </p> <p> … </p> <p> … </p> <p> … </p> </td>
you can find webpage here
my python code following
soup = beautifulsoup(page) div = soup.find('td', attrs={'class': 'eelantext'}) print div text = div.find_all('p')
but text variable empty , if print div variable, have same html above except <p>
tags.
beautifulsoup can use different parsers handle html input. html input here little broken, , default htmlparser
parser doesn't handle well.
use html5lib
parser instead:
>>> len(beautifulsoup(r.text, 'html').find('td', attrs={'class': 'eelantext'}).find_all('p')) 0 >>> len(beautifulsoup(r.text, 'lxml').find('td', attrs={'class': 'eelantext'}).find_all('p')) 0 >>> len(beautifulsoup(r.text, 'html5lib').find('td', attrs={'class': 'eelantext'}).find_all('p')) 22
Comments
Post a Comment