Extract Wikipedia Data
Last updated
Last updated
I need to mention that we are not going to web scrape wikipedia pages manually, wikipedia module already did the tough work for us. Let's install it:
Open up a Python interactive shell or an empty file and follow along.Let's get the summary of what Python programming language is:
This will extract the summary from this wikipedia page. More specifically, it will print some first sentences, we can specify the number of sentences to extract:
Notice that I misspelled the query intentionally, it still gives me an accurate result.Search for a term in wikipedia search:
This returned a list of related page titles, let's get the whole page for "Neural network" which is "result[0]":
Extracting the title:
Getting all the categories of that Wikipedia page:
Extracting the text after removing all HTML tags (this is done automatically):
All links:
The references:
Finally, the summary:
Let's print them out:
Try it out !Alright, we are done, this was a brief introduction on how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question answering chatbot, making a wrapper application around this and much more! The possibilities are endless, tell us what you did with this in the comments below !If this tutorial was helpful. Buy me a Coffee -> buymeacoff.ee/gajeshnaik
Reference : https://hackernoon.com/how-to-extract-wikipedia-data-using-python-l34l32wo