How Scrape Websites with Python 3
May 17, 2020
Last updated
Was this helpful?
May 17, 2020
Last updated
Was this helpful?
Web scraping is the process of extracting data from websites.
Before attempting to scrape a website, you should make sure that the provider allows it in their terms of service. You should also check to see whether you could use an API instead.
Massive scraping can put a server under a lot of stress which can result in a denial of service. And you don't want that.
This article is for advanced readers. It will assume that you are already familiar with the Python programming language.
At the very minimum you should understand list comprehension, context manager, and functions. You should also know how to set up a virtual environment.
We'll run the code on your local machine to explore some websites. With some tweaks you could make it run on a server as well.
This article will also explain what to do if that website is using JavaScript to render content (like React.js or Angular).
Before I can start, I want to make sure we're ready to go. Please set up a virtual environment and install the following packages into it:
beautifulsoup4 (version 4.9.0 at time of writing)
requests (version 2.23.0 at time of writing)
wordcloud (version 1.17.0 at time of writing, optional)
selenium (version 3.141.0 at time of writing, optional)
First things first: I create a file urls.txt
holding all the URLs I want to download:
urls.txt
Next, I write a bit of Python code in a file called scraper.py
to download the HTML of this files.
In a real scenario, this would be too expensive and you'd use a database instead. To keep things simple, I'll download files into the same directory next to the store and use their name as the filename.
scraper.py
By downloading the files, I can process them locally as much as I want without being dependent on a server. Try to be a good web citizen, okay?
Now that I've downloaded the files, it's time to extract their interesting features. Therefore I go to one of the pages I downloaded, open it in a web browser, and hit Ctrl-U to view its source. Inspecting it will show me the HTML structure.
Since I have a second step now, I'm going to refactor the code a bit by putting it into functions and add a minimal CLI.
scraper.py
Now I can run the code in three ways:
Without any arguments to run everything (that is, download all URLs and extract them, then save to disk) via: python scraper.py
With an argument of download
and a url to download python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html
. This will not process the file.
With an argument of parse
and a filepath to parse: python scraper.py art_1.html
. This will skip the download step.
With that, there's one last thing missing.
Let's say I want to generate a word cloud for each article. This can be a quick way to get an idea about what a text is about. For this, install the package wordcloud
and update the file like this:
scraper.py
Then I instantiate a WordCloud instance with the list of stopwords I downloaded and the text of the law. It will be turned into an image with the same basename.
After the first run, I discover that the list of stopwords is incomplete. So I add additional words I want to exclude from the resulting image.
With that, the main part of web scraping is complete.
SPAs - or Single Page Applications - are web applications where the whole experience is controlled by JavaScript, which is executed in the browser. As such, downloading the HTML file does not bring us far. What should we do instead?
Since the code will be slower, I create a new file called crawler.py
for it. The content looks like this:
crawler.py
Here, Python is opening a Firefox instance, browsing the website and looking for an <article>
element. It is copying over its text into a dictionary, which gets read out in the transform
step and turned into a WordCloud during load
.
Thanks for reading this far! Let's summarise what we've learned now:
How to scrape a website with Python's requests
package.
How to translate it into a meaningful structure using beautifulsoup
.
How to further process that structure into something you can work with.
What to do if the target page is relying on JavaScript.
At the end of this article, you will know how to download a webpage, parse it for interesting information, and format it in a usable format for further processing. This is also known as .
You can find the code for this project in this .
For this example, we are going to scrape the . (Don't worry, I checked their Terms of Service. They offer an XML version for machine processing, but this page serves as an example of processing HTML. So it should be fine.)
In my case, I figured I want the text of the law without any markup. The element wrapping it has an id of container
. Using BeautifulSoup I can see that a combination of and [get_text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)
will do what I want.
What changed? For one, I downloaded a from GitHub. This way, I can eliminate the most common words from the downloaded law text.
We'll use the browser. With Selenium. Make sure to also. Download the .tar.gz archive and unpack it in the bin
folder of your virtual environment so it will be found by Selenium. That is the directory where you can find the activate
script (on GNU/Linux systems).
As an example, I am using the here. Angular is a popular SPA-Framework written in JavaScript and guaranteed to be controlled by it for the time being.
When dealing with JavaScript-heavy sites, it is often useful to use and perhaps run even [execute_script](https://selenium-python.readthedocs.io/api.html#selenium.webdriver.remote.webdriver.WebDriver.execute_script)
to defer to JavaScript if needed.
Reference :