Scrape Wikipedia Articles
Last updated
Last updated
In this article I'm going to create a web scraper in Python that will scrape Wikipedia pages.
The scraper will go to a Wikipedia page, scrape the title, and follow a random link to the next Wikipedia page.
I think it will be fun to see what random Wikipedia pages this scraper will visit!
To start, I'm going to create a new python file called scraper.py
:
To make the HTTP request, I'm going to use the requests
library. You can install it with the following command:
Let's use the web scraping wiki page as our starting point:
When running the scraper, it should display a 200 status code:
Alright, so far so good! ð
Let's extract the title from the HTML page. To make my life easier I'm going to use the BeautifulSoup package for this.
Beautiful soup allows you to find an element by the ID tag.
Bringing it all together the program now looks like this:
And when running this, it shows the title of the Wiki article: ð
Now I'm going to dive deep into Wikipedia. I'm going to grab a random <a>
tag to another Wikipedia article and scrape that page.
To do this I will use beautiful soup to find all the <a>
tags within the wiki article. Then I shuffle the list to make it random.
As you can see, I use the soup.find(id="bodyContent").find_all("a")
to find all the <a>
tags within the main article.
Since I'm only interested in links to other wikipedia articles, I make sure the link contains the /wiki
prefix.
When running the program now it displays a link to another wikipedia article, nice!
Alright, let's make the scraper actually scrape the new link.
To do this I'm going to move everything into a scrapeWikiArticle
function.
The scrapeWikiArticle
function will get the wiki article, extract the title, and find a random link.
Then, it will call the scrapeWikiArticle
again with this new link. Thus, it creates an endless cycle of a Scraper that bounces around on wikipedia.
Let's run the program and see what we get:
Awesome, in roughly 10 steps we went from "Web Scraping" to "Feminism in Brazil". Amazing!
We've built a web scraper in Python that scrapes random Wikipedia pages. It bounces around endlessly on Wikipedia by following random links.
This is a fun gimmick and Wikipedia is pretty lenient when it comes to web scraping.
There are also harder to scrape websites such as Amazon or Google. If you want to scrape such a website, you should set up a system with headless Chrome browsers and proxy servers. Or you can use a service that handles all that for you like this one.
But be careful not to abuse websites, and only scrape data that you are allowed to scrape.
Reference : https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
When inspecting the Wikipedia page I see that the title tag has the #firstHeading
ID.