Hacker News new | past | comments | ask | show | jobs | submit login
Beautiful Soup (crummy.com)
226 points by memorable on June 8, 2022 | hide | past | favorite | 91 comments



Great memories with this library, one of my all time favs.

It is fast? no.

But it had a fantastic mission: extracting data from malformed HTML.

Might be less common now but back then (~10+ years ago) it was still rampant. Many if not most parsers would barf on any deviation from the standard, leaving you to hand-roll regex solutions and ugly corner cases.

BS covered a LOT of these cases without forcing you to write terrible code. It mostly just worked, with a reasonable API, and stellar, well-written, example-laden docs.


"Might be less common now"

It is less necessary now. One of the most important parts of the HTML5 standards, IMHO, is that it specifies how to parse HTML that doesn't conform to the standards in a standard way. In principle, every bag of bytes now has a standard-compliant way to parse it that every HTML5 parser should agree on. I don't use this to know how many edge cases the standard and/or implementations have, but it's a lot better than it used to be, and means that every HTML5 parser has many of the capabilities that Beautiful Soup used to (nearly-)uniquely have for parsing messy HTML.

I suspect Beautiful Soup was a non-trivial aspect of how the decision to implement such a spec was decided upon. It proved the idea to be a very valuable one at a time when most languages lacked such a library. Basically, BS won so hard that while it wasn't necessarily directly adopted as a standard, the essence of it certainly was.


Not just HTML standards, but I've used their detwingle function because one of the sites I'm scraping has a mixture of Windows-1252 and Unicode. It was clearly stored correctly, but encoded differently depending on what page view you were looking at. For example titles in an outline were broken, but on the actual individual pages fine. Their rendering also treated multibyte characters incorrectly during truncation.


I work in finance and this thing is still indispensable for us. A LOT of financial info is only presented on a dynamically rendered HTML page that was last written in the bad old days.


Absolutely. Until you have had to deal with this kind of problem first-hand, you have no idea how much of a relief it is that it exists.

Sometimes, finding and using the right library can completely turn around a f'd project.


True.

And BS has been that library for me on at least 2 such projects.


Any poignant examples?


It's actually pretty trivial to speed up, if you have multiple documents to parse, you can use multiprocessing.

    from multiprocessing import Pool
    
    def parse(html):
      result = []
      soup = BeautifulSoup(html, 'html.parser')
      for p in soup.select('div > p'):
        result.append(p.text)
      return result
    
    with Pool(processes=16) as pool:
      for texts in pool.imap_unordered(parse, my_html_texts):
        for text in texts:
          print(text)


You might want to try a different parser as well, I tried a basic performance test comparison in another comment (https://gist.github.com/MercuryRising/4061368) and the html.parser was very slow compared to the lxml-xml or xml parsers for bs4

    ==== Total trials: 100000 =====
    bs4 lxml total time: 110.9
    bs4 html.parser total time: 87.6
    bs4 lxml-xml total time: 0.5
    bs4 xml total time: 0.5
    bs4 html5lib total time: 103.6
    pq total time: 8.7
    lxml (cssselect) total time: 8.8
    lxml (xpath) total time: 5.6
    regex total time: 13.8 (doesn't find all p)


If you want fast HTML parsing in python+lxml, use html5-parser https://github.com/kovidgoyal/html5-parser


it's a pretty common case to be parsing thousands of locally-stored pages. There are only so many cores and the task of scraping an entire site can still easily still be CPU limited.


Have literally been building something with BS today. It’s very much still a current library, I imagine in some areas people have moved on but I will continue to reach for it.


> Might be less common now

That's because it's often deeply buried under more fashionable abstractions.


absolutely. i wrote this mobile allergy data thing and getting the data was mostly scraping from news websites that did terrible things with javascript to keep scrapers and ad blockers out. Beautiful Soup worked past all of that easily. Probably my favorite Python library.


I remember that I needed to do something involving performance with beautiful soup. Switching the HTML parser backend (as they mention in the docs in BS4) gave me an order of magnitude speedup...


I'm not sure why you're using the past tense, the current version is only a year old, and I will never stop using it because panning messy datasets for hidden gold is my thing.


What type of datasets?


Things with janky APIs or built on abandonware like imageboards, niche forums, dead comment communities, data dumps etc.


doing this professionally? Just curious..


Yes, I'm a subject matter person who can code rather than a coder selling into a rewarding market. Someone with better business skills than I might find it more financially rewarding.


That sounds interesting.


Same. I had to parse HTML written by eBay sellers. They were professional sellers, not webmasters.


the jquery of backend


I used BS to scrape rogerebert.com and post his reviews to letterboxd:

https://letterboxd.com/re2/

I copied over only the first two paragraphs of each of his reviews with a link back to the original.

The HTML is a total mess having obviously moved around the web a couple times. So it required a bunch of cleanup. But that wasn't even the hard part. The hard part was getting the correct TMDB ID for the movies because his reviews also have either no useful metadata or metadata that's wrong, like incorrect movie years, misspelled actor names, etc.

I never was able to get API access to letterboxd, but they have a CSV import feature which worked-out well enough.


I had my share of gigs where we just decided to scrape the old site with BS and extract structured data from there to render a new site. It was sometimes cheaper than dealing with their ancient ad-hoc cms monstrosities.


After I managed to wrangle the review text from the HTML it still needed this sort of cleanup:

    def clean_text(text):
        text = re.sub(r"[\x7f-\x9f]", "", text)  # remove control chars
        text = re.sub(r"[\xa0\r\t]+", " ", text)  # replace with spaces
        text = re.sub(r"\n+", "\n", text)  # squash runs of newlines
        text = re.sub(r"\s+", " ", text)  # squash runs of spaces
        # Remove newlines unless they appear to be at the end of a sentence
        # or if the sentence is shorter than 80 characters.
        text = re.sub(r"([^.?!\"\)])\n", r"\1 ", text)
        text = re.sub(r"\n([^\n]{,80})\n", r"\1 ", text)
        return text.strip()


I find this to be a better version of the docs: https://beautiful-soup-4.readthedocs.io/en/latest/#

Just in case someone wants a comment overview of what this superbly named library is: web scraping (html parsing) in python


The crummy.com page includes several suggestions to subscribe to Tidelift.


I found lxml.html a lot easier to work with than bs4, in case that helps anyone else.

https://lxml.de/lxmlhtml.html


On the off chance you were not aware, bs4 also supports[0] getting parse events from html5lib[1] which (as its name implies) is far more likely to parse the text the same way the browser would

0: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index....

1: https://pypi.org/project/html5lib/


BeautifulSoup is an API for multiple parsers https://beautiful-soup-4.readthedocs.io/en/latest/#installin... :

  BeautifulSoup(markup, "html.parser") 
  BeautifulSoup(markup, "lxml")
  BeautifulSoup(markup, "lxml-xml")
  BeautifulSoup(markup, "xml") 
  BeautifulSoup(markup, "html5lib")
Looks like lxml w/ xpath is still the fastest with Python 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison" https://gist.github.com/MercuryRising/4061368 ; which is fine for parsing (X)HTML(5) that validates<

(EDIT: Is xml/html5 a good format for data serialization? defusedxml ... Simdjson, Apache arrow.js)


I was curious, so I tried that performance test you linked to on my machine with the various parsers:

    ==== Total trials: 100000 =====
    bs4 lxml total time: 110.9
    bs4 html.parser total time: 87.6
    bs4 lxml-xml total time: 0.5
    bs4 xml total time: 0.5
    bs4 html5lib total time: 103.6
    pq total time: 8.7
    lxml (cssselect) total time: 8.8
    lxml (xpath) total time: 5.6
    regex total time: 13.8 (doesn't find all p)
bs4 is damn fast with the lxml-xml or xml parsers


You want a proper html 5 parser that can handle non valid documents. And the fastest one is https://github.com/kovidgoyal/html5-parser over 30x faster than html5lib


Same here, I am unable to properly quantify it but there was something about the soup api I did not really like.

It may have been because I learned on the python xml.etree library in base(I moved to lxml because it has the same api but is faster and knows about parent nodes) and had a hard time with the soup api.

But I think it was the way it overloaded the selectors. I did not like the way you could magically find elements. I may have to revisit it and try and figure out why and if I still do not like it.


Is Beautiful Soup still the best way to scrape the web with python?

IIRC, Beautiful Soup doesn't handle javascript, so at least for JS you're forced to use something else.

I'm also looking forward to seeing how people scrape the web once Web Assembly becomes prevalent.


For JS, I've used Selenium with a Chrome driver, then parsed the HTML with BeautifulSoup. I know nothing about web development, so this might be an outdated way, but it worked. BS was nice to deal with.


I used Puppeteer for this to great success. Very easy to set up, and you control the whole browser + access to everything you would have access to normally through Chrome dev tools.


It can be useful if it fits your case, the more recent scrapers run a whole browser with automation which can make scraping stuff a lot easier, since js will run etc


I’ve found Playwright to be a really great tool for scraping


Headless scraping is in the region of 10x slower and more resource intensive, even when carefully blocking requests such as images. It should always be a second choice.

Other than that Playwright is incredible, by far the best browser automation api.


For sure it’s a heavy approach, but if you need a full blown browser with JS, then that’s just what you’ll have to do. Use the right tool for the job.


Playwright is easily the best for browser automation. I still use requests + beautiful soup often as well.


This sounds interesting. Any resources for a beginner? I use Selenium regularly.


Playwright for Python has really good documentation: https://playwright.dev/python/

I used it for my https://shot-scraper.datasette.io/ tool, and wrote a bit about CLI-driven scraping using that tool here: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...


Thanks!


selenium stinks in comparison to playwright and puppeteer


If you don't need JS, I think it's still the best. Sure, that doesn't always work for you, but once you start needing to render JS the scraping slows down tremendously.


Years ago, I got the privilege of working at the same company with the author, Leonard Richardson. Really nice guy, super nerd and hilariously funny.


We used this in a project many suns ago and we ended up switching to libxml2, less pretty presentation, but more functional. YMMV.


bs4 introduced some very nice features over bs3, if that's what you were using, and includes the ability to use libxml2 as a parser. For very simple things though libxml2 would be a better fit.


bs4 is able to parse some malformed documents that libxml2 chokes on.

For these cases it can be useful to do the reverse, and use the BeautifulSoup HTML parser as an alternative parser backend for the lxml package: https://lxml.de/elementsoup.html


I used this recently to scrape a bunch of online restaurant reviews by a guy who I really like, use regex to get the postcode from the markup, do a geocode using postcodes.io, then plot the reviews on a Google map. It took about two / three hours and felt kinda dirty in a good way. Beautiful Soup made the first part really easy.


Hmm, cool project idea but if the subject doesn't know you're doing this, it is a bit weird to stalk their online footprint.

Unless you mean the person is a professional restaurant reviewer and you like their opinion.

Oh well, good luck either way.


"Hmm, cool project idea but if the subject doesn't know you're doing this, it is a bit weird to stalk their online footprint."

Wait, which one is the real cyberlurker? ;)


Yes, they're a professional reviewer and I enjoy their reviews.



I used Beautiful Soup on one of my first successful programming projects, I still remember how easy it was, and it taught me a lot about python.


How helpful is this when you're dealing with a website that does not degrade gracefully and insists using JavaScript to shove things in where a static webpage would work? (For example, scraping football scores from NFL.com)


In those cases you might want to check out SeleniumBase: https://seleniumbase.io/


If you're lucky those sites have the raw data a server side generated JSON payload right in the site source code markup.

For example Target is clientside, but has all the data in a `window.FOOBAR = json` variable you can fetch and parse with some substring magic. Much easier than spinning up chromedriver and some package.


its not useful in those cases, but usually for those js rendered sites you can replicate the ajax requests which happen and get nicely formed json documents to parse through instead.


Or the data is stored in js objects within script tags in the html and can be extracted programmatically. It's getting common with SSG sites using SPA frameworks.

For example, the new Google Play Store website stores the data in AF_initDataCallback calls and can be extracted with re.findall(r"<script nonce=\"\S+\">AF_initDataCallback\((.*?)\);", html_string).


I used to do that when I was responsible for a set of web crawlers to extract public records data, but the problem is that changes happen and these sorts of things become out of date fairly quickly.

Getting this working in a headless browser driven by Selenium would probably be easier for maintainability.


nowadays you usually have to submit http headers and cookies too, that's always a fun process of elimination


Easiest in that case is probably to use something like headless chrome. But that is also significantly more demanding in terms of resources.


The author also wrote a novel called Constellation Games which I enjoyed a lot https://constellation.crummy.com/


Also, free online, the very entertaining short story “Let us now praise awesome dinosaurs”: http://strangehorizons.com/fiction/let-us-now-praise-awesome...


I fondly remember being introduced to this library as part of the first project I worked on at my first development job. I was lucky there it was the right challenge at the right time.


Beautiful Soup is an incredibly robust, and powerful tool. However, it can be sometimes be an intimidating tool for beginners (which ".find_x_y_z" method should I use again?). To that end you should check out gazpacho (with just a single "find" method): https://github.com/maxhumber/gazpacho


Would suggest checking out pyquery. It uses JQuery-like syntax. It's been around a long time and in my opinion, it's way easier to use https://pypi.org/project/pyquery/.


Just used this to scrap game guides from GameFAQs. Pretty easy to use, would recommend for quick projects.


For web scraping I used htmltidy (https://en.wikipedia.org/wiki/HTML_Tidy) which cleaned it sufficiently that I could run XSLT over it (gags at the memory)


I used Beautiful soup for a project recently (grabbing a series of page titles for export to CSV), it's super useful, thanks for the pointer OP.


Boy...15 years ago I was reaching to this and hpricot almost every week to do some cool scraping/parsing of some kind. I always loved the bs API


Beautiful Soup got me my first job.


WWW::Mechanize fam where you at?


The downside is using Perl.


Does anyone know if there as a good equivalent for Go?


I've heard https://github.com/gocolly/colly#readme mentioned fondly, but I've never used it


    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
  e.Request.Visit(e.Attr("href"))
 })
Sometimes I wish Go idioms included an iterator abstraction, it's easier to understand and less hideous than that functional callback style.


> Does anyone know if there as a good equivalent for Go

Yes: https://github.com/anaskhan96/soup

It works well.


I wrote my own HTML parser in Pascal 15 years ago. Pascal is much faster than Python


whats new about it? why is this post here? Beautiful Soup has been around for a long long time.


See: https://news.ycombinator.com/newsguidelines.html

Particularly: On-Topic: Anything that good hackers would find interesting. That includes more than hacking and startups. If you had to reduce it to a sentence, the answer might be: anything that gratifies one's intellectual curiosity.


Why is this kind of post allowed on HN? It's not recent nor relevant and not specific in any meaningful way (literally linked to the homepage). I occasionally see posts just linking to Wikipedia articles as well, same sort of feel as this. At the least, OP should have to offer some sort of discussion point or tidbit from the linked content.


Personally these and the Wikipedia posts are my favorite posts on here. News is cool, but there's a lot of cool things that don't change very often, and I love seeing those things too.

Also, in response to your "low effort posts do not invite meaningful discussion" from a different comment, I don't see how this is any lower effort than every other link only post (i.e. the vast majority)? And there's over 40 comments on this thread now talking about other scrapers, projects you can do with scrapers, better docs, tangential use cases and how to handle them, etc. Seems like a lot of people have a variety of things to say about this, I don't see how that's not "meaningful discussion".

EDIT: I also disagree with requiring a couple of sentences from the submitter. If they have something to say they can say it, otherwise it's fine if they don't try to influence the discussion - it's more interesting to see where the random commenters take something, then trying to chart a course.


I used to feel this way too, but then I realized that we need to welcome the newcomers (not just newcomers to HN, but also newcomers to this material, and especially young people) who haven't yet encountered these things for the first time. For us grizzled oldtimers, they may be classics or perennials, but not for everybody. So it's ok for these posts to be part of the mix.

One of my favorite things about HN is that it has a lot of both serious oldtimers and high school students. It's actually a place where one can, at least sporadically, get the technical mentorship that many of us longed for, but missed, early in our careers.

Wikipedia submissions are a special case and somewhat different: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que....


Because it's nice to exist outside the news cycle of what's "current" every once in a while :)


You have other social media for that. Low effort posts do not invite meaningful discussion and add moderation load.


Did you know that you don't have to read every article posted here, nor read every comment that is added?

If you don't want to participate in this post, it's OK to skip it. I skip dozens of posts a day - the best part is it's more efficient than going to them and putting the effort to whine!

As for not inviting meaningful discussion: there's some good discussion on this post - the very article you claim isn't capable of generating such.


However, I don't recall hearing the moderators complain about it.

I'm guessing most of the difficult moderation would be on the newsy more-controversial posts anyways.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: