Extract Wikipedia Data

How to Extract Wikipedia Data Using Python

April 28th 2020

I need to mention that we are not going to web scrape wikipedia pages manually, wikipedia module already did the tough work for us. Let's install it:

Open up a Python interactive shell or an empty file and follow along.Let's get the summary of what Python programming language is:

This will extract the summary from this wikipedia page. More specifically, it will print some first sentences, we can specify the number of sentences to extract:

Notice that I misspelled the query intentionally, it still gives me an accurate result.Search for a term in wikipedia search:

This returned a list of related page titles, let's get the whole page for "Neural network" which is "result[0]":

Extracting the title:

Getting all the categories of that Wikipedia page:

Extracting the text after removing all HTML tags (this is done automatically):

All links:

The references:

Finally, the summary:

Let's print them out:

Try it out !Alright, we are done, this was a brief introduction on how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question answering chatbot, making a wrapper application around this and much more! The possibilities are endless, tell us what you did with this in the comments below !If this tutorial was helpful. Buy me a Coffee -> buymeacoff.ee/gajeshnaik

Reference : https://hackernoon.com/how-to-extract-wikipedia-data-using-python-l34l32wo

Last updated

Was this helpful?