python - Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds -


i noticed annoying bug: beautifulsoup4 (package: bs4) finds less tags previous version (package: beautifulsoup).

here's reproductible instance of issue:

import requests import bs4 import beautifulsoup  r = requests.get('http://wordpress.org/download/release-archive/') s4 = bs4.beautifulsoup(r.text) s3 = beautifulsoup.beautifulsoup(r.text)  print 'with beautifulsoup 4 : {}'.format(len(s4.findall('a'))) print 'with beautifulsoup 3 : {}'.format(len(s3.findall('a'))) 

output:

with beautifulsoup 4 : 557 beautifulsoup 3 : 1701 

the difference not minor can see.

here exact versions of modules in case wondering:

in [20]: bs4.__version__ out[20]: '4.2.1'  in [21]: beautifulsoup.__version__ out[21]: '3.2.1' 

you have lxml installed, means beautifulsoup 4 use that parser on standard-library html.parser option.

you can upgrade lxml 3.2.1 (which me returns 1701 results test page); lxml uses libxml2 , libxslt may blame here. may have upgrade those instead / well. see lxml requirements page; libxml2 2.7.8 or newer recommended.

or explicitly specify other parser when parsing soup:

s4 = bs4.beautifulsoup(r.text, 'html.parser') 

Comments

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

javascript - storing input from prompt in array and displaying the array -