Web scraping with Beautiful Soup

Author

Marie-Hélène Burle

The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.

Web scraping tools allow to automate parts of that process and Python is a popular language for the task.

In this workshop, I will guide you through a simple example using the package Beautiful Soup.

HTML and CSS

HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>

Examples:

Site structure:

  • <h2>This is a heading of level 2</h2>
  • <p>This is a paragraph</p>

Formatting:

  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Web scrapping

Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.

While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.

Example for this workshop

We will use a website from the University of Pennsylvania that gives access to three million free online books.

Our goal is get a list of books by Proust available in that database. We can go to the online books author search page and enter “Proust” in the search box. This shows us the data we are interested in. From here on, you could copy-paste the data one line at a time, but it is rather tedious. Instead, we will automate this task.

Load packages

Let’s load the packages that will make scraping websites with Python easier:

import requests                 # To download the HTML data from a site
from bs4 import BeautifulSoup   # To parse the HTML data
import time                     # To add a delay between each requests (part of the standard library)
import polars as pl             # To store our data in a DataFrame

Let’s create a string with the URL we want to scrape:

url = "https://onlinebooks.library.upenn.edu/webbin/book/search?author=proust&amode=words"

First, we send a request to that URL and save the response in a variable called r:

r = requests.get(url)

Let’s see what our response looks like:

r
<Response [200]>

If you look in the list of HTTP status codes, you can see that a response with a code of 200 means that the request was successful.

Explore the raw data

To get the actual content of the response as unicode (text), we can use the text property of the response. This will give us the raw HTML markup from the webpage.

Let’s print the first 400 characters:

print(r.text[:400])
<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8">
<link rel="stylesheet" type="text/css" href="https://onlinebooks.library.upenn.edu/olbp.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Search Results | The Online Books Page</title></head>
<body>
<header>
<h1><a href="https://onlinebooks.library.upenn.edu/" class="logolink">The Online Books Page</a

Parse the data

The package Beautiful Soup transforms (parses) such HTML data into a parse tree, which will make extracting information easier.

Let’s create an object called mainpage with the parse tree:

data = BeautifulSoup(r.text, "html.parser")

html.parser is the name of the parser that we are using here. It is better to use a specific parser to get consistent results across environments.

We can print the beginning of the parsed result:

print(data.prettify()[:400])
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://onlinebooks.library.upenn.edu/olbp.css" rel="stylesheet" type="text/css"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Search Results | The Online Books Page
  </title>
 </head>
 <body>
  <header>
   <h1>
    <a class="logolink" href="https://onlinebooks.library.upenn

The prettify method turns the BeautifulSoup object we created into a string (which is needed for slicing).

It doesn’t look any more clear to us, but it is now in a format the Beautiful Soup package can work with.

For instance, we can get the HTML segment containing the title of the page with three methods:

  • using the title tag name:
data.title
<title>Search Results | The Online Books Page</title>
  • using find to look for HTML markers (tags, attributes, etc.):
data.find("title")
<title>Search Results | The Online Books Page</title>
  • using select which accepts CSS selectors:
data.select("title")
[<title>Search Results | The Online Books Page</title>]

find will only return the first element. find_all will return all elements. select will also return all elements. Which one you chose depends on what you need to extract. There often several ways to get you there.

Below are other examples of data extraction.

The beginning of the parsed data:

data.head
<head>
<meta charset="utf-8"/>
<link href="https://onlinebooks.library.upenn.edu/olbp.css" rel="stylesheet" type="text/css"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Search Results | The Online Books Page</title></head>

The first link in the data:

data.a
<a class="logolink" href="https://onlinebooks.library.upenn.edu/">The Online Books Page</a>

The first 5 links (returned as a list):

data.find_all("a")[:5]
[<a class="logolink" href="https://onlinebooks.library.upenn.edu/">The Online Books Page</a>,
 <a href="https://onlinebooks.library.upenn.edu/webbin/book/browse?type=lcsubc&amp;key=Proust%2c%20Marcel%2c%201871%2d1922">Online books about this author</a>,
 <a href="https://en.wikipedia.org/wiki/Marcel_Proust">Wikipedia article</a>,
 <a href="https://onlinebooks.library.upenn.edu/webbin/book/lookupid?key=olbp26606"><img alt="[Info]" class="info" src="https://onlinebooks.library.upenn.edu/info.gif"/></a>,
 <a href="https://gutenberg.net.au/ebooks03/0300501.txt"><cite>The Captive</cite> (English translation published 1929)</a>]

or:

data.select("a")[:5]
[<a class="logolink" href="https://onlinebooks.library.upenn.edu/">The Online Books Page</a>,
 <a href="https://onlinebooks.library.upenn.edu/webbin/book/browse?type=lcsubc&amp;key=Proust%2c%20Marcel%2c%201871%2d1922">Online books about this author</a>,
 <a href="https://en.wikipedia.org/wiki/Marcel_Proust">Wikipedia article</a>,
 <a href="https://onlinebooks.library.upenn.edu/webbin/book/lookupid?key=olbp26606"><img alt="[Info]" class="info" src="https://onlinebooks.library.upenn.edu/info.gif"/></a>,
 <a href="https://gutenberg.net.au/ebooks03/0300501.txt"><cite>The Captive</cite> (English translation published 1929)</a>]

Identify relevant markers

The HTML code for this webpage contains the data we are interested in, but it is mixed in with a lot of HTML formatting and data we don’t care about. We need to extract the data relevant to us and turn it into a workable format.

The first step is to find the HTML markers that contain our data. One option is to use a web inspector or—even easier—the SelectorGadget, a JavaScript bookmarklet built by Andrew Cantino.

To use this tool, go to the SelectorGadget website and drag the link of the bookmarklet to your bookmarks bar.

Now, go to the search result page and click on the bookmarklet in your bookmarks bar. You will see a floating box at the bottom of your screen. As you move your mouse across the screen, an orange rectangle appears around each element over which you pass and it give you the HTML tag of that element.

Click on one of the titles by Proust: now, cite appears in the box at the bottom as well as the number of elements selected. The selected elements are highlighted in yellow.

Now you know that there are 125 titles (the number on the JavaScript box).

Extract the first title

It is a good idea to test things out on a single element before doing a massive batch scraping of a site, so let’s test our method for the first title.

type(data.select("cite"))
bs4.element.ResultSet
len(data.select("cite"))
125

To get the first one, we index it:

data.select("cite")[0]
<cite>The Captive</cite>
data.select("cite")[1]
<cite>Cities of the Plain</cite>
data.select("cite")[3]
<cite>Sodome et Gomorrhe</cite>