python - Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds -
i noticed annoying bug: beautifulsoup4 (package: bs4
) finds less tags previous version (package: beautifulsoup
).
here's reproductible instance of issue:
import requests import bs4 import beautifulsoup r = requests.get('http://wordpress.org/download/release-archive/') s4 = bs4.beautifulsoup(r.text) s3 = beautifulsoup.beautifulsoup(r.text) print 'with beautifulsoup 4 : {}'.format(len(s4.findall('a'))) print 'with beautifulsoup 3 : {}'.format(len(s3.findall('a')))
output:
with beautifulsoup 4 : 557 beautifulsoup 3 : 1701
the difference not minor can see.
here exact versions of modules in case wondering:
in [20]: bs4.__version__ out[20]: '4.2.1' in [21]: beautifulsoup.__version__ out[21]: '3.2.1'
you have lxml
installed, means beautifulsoup 4 use that parser on standard-library html.parser
option.
you can upgrade lxml 3.2.1 (which me returns 1701 results test page); lxml uses libxml2
, libxslt
may blame here. may have upgrade those instead / well. see lxml requirements page; libxml2 2.7.8 or newer recommended.
or explicitly specify other parser when parsing soup:
s4 = bs4.beautifulsoup(r.text, 'html.parser')
Comments
Post a Comment