📉
Tutorials
  • Computer History
  • Function
    • Finance
      • Calculate
    • Manage Data
    • Date&Time
    • Strings and Character
  • Snippets
    • Web Application
      • Hugo
      • JavaScript
        • Stopwatch using JavaScript?
    • Note
    • Start Project
      • GitHub
      • GitLab
    • Python Programming
      • Strings and Character Data
      • List
      • Dictionaries
    • Data Science
      • Setting Option
      • Get Data
  • Link Center
    • Next Articles
    • Google
    • Excel VBA
    • Python
      • Notebook
    • WebApp
      • Vue.js
    • Finance
    • Project
      • Kids
        • Scratch
      • Finance
        • Plotly.js
        • Portfolio
      • Mini Lab
        • Systems Administration
        • Auto Adjust Image
      • Sending Emails
      • ECS
        • Knowledge Base
        • ระบบผู้เชี่ยวชาญ (Expert System)
        • Check product
        • Compare two SQL databases
      • e-Library
        • Knowledge base
        • การจัดหมวดหมู่ห้องสมุด
        • Temp
      • AppSheet
        • บัญชีรายรับรายจ่าย
      • Weather App
      • COVID-19
  • Tutorials
    • Data Science
      • Data Science IPython notebooks
    • UX & UI
      • 7 กฎการออกแบบ UI
    • Web Scraping
      • Scrape Wikipedia Articles
      • Quick Start
    • GUI
      • pysimple
        • Create a GUI
      • Tkinter
        • Python Tkinter Tutorial
      • PyQt
        • PyQt Tutorial
    • MachineLearning
      • การพัฒนา Chat Bot
      • AI ผู้ช่วยใหม่ในการทำ Customer Segmentation
      • Customer Segmentation
      • ตัดคำภาษาไทย ด้วย PyThaiNLP API
    • Excel & VBA
      • INDEX กับ MATCH
      • รวมสูตร Excel ปี 2020
      • How to Write Code in a Spreadsheet
    • Visualization
      • Bokeh
        • Part I: Getting Started
        • Data visualization
        • Plotting a Line Graph
        • Panel Document
        • Interactive Data Visualization
    • VueJS
      • VueJS - Quick Guide
    • Django
      • Customize the Django Admin
      • พัฒนาเว็บด้วย Django
    • Git
      • วิธีสร้าง SSH Key
      • Git คืออะไร
      • เริ่มต้นใช้งาน Git
      • การใช้งาน Git และ Github
      • รวม 10 คำสั่ง Git
      • GIT Push and Pull
    • Finance
      • Stock Analysis using Pandas (Series)
      • Building Investment AI for fintech
      • Resampling Time Series
      • Python for Finance (Series)
      • Stock Data Analysis (Second Edition)
      • Get Stock Data Using Python
      • Stock Price Trend Analysis
      • Calculate Stock Returns
      • Quantitative Trading
      • Backtrader for Backtesting
      • Binance Python API
      • Pine Script (TradingView)
      • Stocks Analysis with Pandas and Scikit-Learn
      • Yahoo Finance API
      • Sentiment Analysis
      • yfinance Library
      • Stock Data Analysis
      • YAHOO_FIN
      • Algorithmic Trading
    • JavaScript
      • Split a number
      • Callback Function
      • The Best JavaScript Examples
      • File and FileReader
      • JavaScript Tutorial
      • Build Reusable HTML Components
      • Developing JavaScript components
      • JavaScript - Quick Guide
      • JavaScript Style Guide()
      • Beginner's Handbook
      • Date Now
    • Frontend
      • HTML
        • File Path
      • Static Site Generators.
        • Creating a New Theme
    • Flask
      • Flask - Quick Guide
      • Flask Dashboards
        • Black Dashboard
        • Light Blue
        • Flask Dashboard Argon
      • Create Flask App
        • Creating First Application
        • Rendering Pages Using Jinja
      • Jinja Templates
        • Primer on Jinja Templating
        • Jinja Template Document
      • Learning Flask
        • Ep.1 Your first Flask app
        • Ep.2 Flask application structure
        • Ep.3 Serving HTML files
        • Ep.4 Serving static files
        • Ep.5 Jinja template inheritance
        • Ep.6 Jinja template design
        • Ep.7 Working with forms in Flask
        • Ep.8 Generating dynamic URLs in Flask
        • Ep.9 Working with JSON data
        • Ep.23 Deploying Flask to a VM
        • Ep.24 Flask and Docker
        • Ep. 25: uWSGI Introduction
        • Ep. 26 Flask before and after request
        • Ep. 27 uWSGI Decorators
        • Ep. 28 uWSGI Decorators
        • Ep. 29 Flask MethodView
        • Ep. 30 Application factory pattern
      • The Flask Mega-Tutorial
        • Chapter 2: Templates
      • Building Flask Apps
      • Practical Flask tutorial series
      • Compiling SCSS to CSS
      • Flask application structure
    • Database
      • READING FROM DATABASES
      • SQLite
        • Data Management
        • Fast subsets of large datasets
      • Pickle Module
        • How to Persist Objects
      • Python SQL Libraries
        • Create Python apps using SQL Server
    • Python
      • Python vs JavaScript
      • Python Pillow – Adjust Image
      • Python Library for Google Search
      • Python 3 - Quick Guide
      • Regular Expressions
        • Python Regular Expressions
        • Regular Expression (RegEx)
        • Validate ZIP Codes
        • Regular Expression Tutorial
      • Python Turtle
      • Python Beginner's Handbook
      • From Beginner to Pro
      • Standard Library
      • Datetime Tutorial
        • Manipulate Times, Dates, and Time Spans
      • Work With a PDF
      • geeksforgeeks.org
        • Python Tutorial
      • Class
      • Modules
        • Modules List
        • pickle Module
      • Working With Files
        • Open, Read, Append, and Other File Handling
        • File Manipulation
        • Reading & Writing to text files
      • Virtual Environments
        • Virtual Environments made easy
        • Virtual Environmen
        • A Primer
        • for Beginners
      • Functions
        • Function Guide
        • Inner Functions
      • Learning Python
        • Pt. 4 Python Strings
        • Pt. 3 Python Variables
      • Zip Function
      • Iterators
      • Try and Except
        • Exceptions: Introduction
        • Exceptions Handling
        • try and excep
        • Errors and Exceptions
        • Errors & Exceptions
      • Control Flow
      • Lambda Functions
        • Lambda Expression คืออะไร
        • map() Function
      • Date and Time
        • Python datetime()
        • Get Current Date and Time
        • datetime in Python
      • Awesome Python
      • Dictionary
        • Dictionary Comprehension
        • ALL ABOUT DICTIONARIES
        • DefaultDict Type for Handling Missing Keys
        • The Definitive Guide
        • Why Functions Modify Lists and Dictionaries
      • Python Structures
      • Variable & Data Types
      • List
        • Lists Explained
        • List Comprehensions
          • Python List Comprehension
          • List Comprehensions in 5-minutes
          • List Comprehension
        • Python List
      • String
        • Strings and Character Data
        • Splitting, Concatenating, and Joining Strings
      • String Formatting
        • Improved String Formatting Syntax
        • String Formatting Best Practices
        • Remove Space
        • Add Spaces
      • Important basic syntax
      • List all the packages
      • comment
    • Pandas
      • Tutorial (GeeksforGeeks)
      • 10 minutes to pandas
      • Options and settings
      • เริ่มต้น Set Up Kaggle.com
      • Pandas - Quick Guide
      • Cookbook
      • NumPy
        • NumPy Package for Scientific
      • IO tools (text, CSV, …)
      • pandas.concat
      • Excel & Google Sheets
        • A Guide to Excel
        • Quickstart to the Google Sheets
        • Python Excel Tutorial: The Definitive Guide
      • Working With Text Data
        • Quickstart
      • API Reference
      • Groupby
      • DateTime Methods
      • DataFrame
      • sort_values()
      • Pundit: Accessing Data in DataFrames
      • datatable
        • DataFrame: to_json()
        • pydatatable
      • Read and Write Files
      • Data Analysis with Pandas
      • Pandas and Python: Top 10
      • 10 minutes to pandas
      • Getting Started with Pandas in Python
    • Markdown
      • Create Responsive HTML Emails
      • Using Markup Languages with Hugo
    • AngularJS
      • Learn AngularJS
    • CSS
      • The CSS Handbook
      • Box Shadow
      • Image Center
      • The CSS Handbook
      • The CSS Handbook
      • Loading Animation
      • CSS Grid Layout
      • Background Image Size
      • Flexbox
  • Series
    • จาวาสคริปต์เบื้องต้น
      • 1: รู้จักกับจาวาสคริปต์
  • Articles
    • Visualization
      • Dash
        • Introducing Dash
    • Finance
      • PyPortfolioOpt
      • Best Libraries for Finance
      • Detection of price support
      • Portfolio Optimization
      • Python Packages For Finance
    • Django
      • เริ่มต้น Django RestFramework
    • General
      • Heroku คืออะไร
      • How to Crack Passwords
    • Notebook
      • IPython Documentation
      • Importing Notebooks
      • Google Colab for Data Analytics
      • Creating Interactive Dashboards
      • The Definitive Guide
      • A gallery of interesting Jupyter Notebooks
      • Advanced Jupyter Notebooks
      • Converting HTML to Notebook
    • Pandas
      • Pandas_UI
      • Pandas Style API
      • Difference Between two Dataframes
      • 19 Essential Snippets in Pandas
      • Time Series Analysis
      • Selecting Columns in a DataFrame
      • Cleaning Up Currency Data
      • Combine Multiple Excel Worksheets
      • Stylin’ with Pandas
      • Pythonic Data Cleaning
      • Make Excel Faster
      • Reading Excel (xlsx) Files
      • How to use iloc and loc for Indexing
      • The Easiest Data Cleaning Method
    • Python
      • pip install package
      • Automating your daily tasks
      • Convert Speech to Text
      • Tutorial, Project Ideas, and Tips
      • Image Handling and Processing
        • Image Processing Part I
        • Image Processing Part II
        • Image tutorial
        • Image Processing with Numpy
        • Converts PIL Image to Numpy Array
      • Convert Dictionary To JSON
      • JSON Dump
      • Speech-to-Text Model
      • Convert Text to Speech
      • Tips & Tricks
        • Fundamentals for Data Science
        • Best Python Code Examples
        • Top 50 Tips & Tricks
        • 11 Beginner Tips
        • 10 Tips & Tricks
      • Password hashing
      • psutil
      • Lambda Expressions
    • Web Scraping
      • Web Scraping using Python
      • Build a Web Scraper
      • Web Scraping for beginner
      • Beautiful Soup
      • Scrape Websites
      • Python Web Scraping
        • Web Scraping Part 1
        • Web Scraping Part 2
        • Web Scraping Part 3
        • Web Scraping Part 4
      • Web Scraper
    • Frontend
      • Book Online with GitBook
      • Progressive Web App คืออะไร
      • self-host a Hugo web app
  • Examples
    • Django
      • Build a Portfolio App
      • SchoolManagement
    • Flask
      • Flask Stock Visualizer
      • Flask by Example
      • Building Flask Apps
      • Flask 101
    • OpenCV
      • Build a Celebrity Look-Alike
      • Face Detection-OpenCV
    • Python
      • Make Game FLASH CARD
      • Sending emails using Google
      • ตรวจหาภาพซ้ำด้วย Perceptual hashing
        • Sending Emails in Python
      • Deck of Cards
      • Extract Wikipedia Data
      • Convert Python File to EXE
      • Business Machine Learning
      • python-business-analytics
      • Simple Blackjack Game
      • Python Turtle Clock
      • Countdown
      • 3D Animation : Moon Phases
      • Defragmentation Algorithm
      • PDF File
        • จัดการข้อความ และรูป จากไฟล์ PDF ด้วย PDFBox
      • Reading and Generating QR codes
      • Generating Password
        • generate one-time password (OTP)
        • Random Password Generator
        • Generating Strong Password
      • PyQt: Building Calculator
      • List Files in a Directory
      • [Project] qID – โปรแกรมแต่งรูปง่ายๆ เพื่อการอัพลงเว็บ
      • Python and Google Docs to Build Books
      • Tools for Record Linking
      • Create Responsive HTML Email
      • psutil()
      • Transfer Learning for Deep Learning
      • ดึงข้อมูลคุณภาพอากาศประเทศไทย
        • Image Classification
    • Web Scraper
      • Scrape Wikipedia Articles
        • Untitled
      • How Scrape Websites with Python 3
    • Finance
      • Algorithmic Trading for Beginners
      • Parse TradingView Stock
      • Creating a stock price database with MariaDB and python
      • Source Code
        • stocks-list
      • Visualizing with D3
      • Real Time Stock in Excel using Python
      • Create Stock Quote Module
      • The Magic Formula Lost Its Sparkle?
      • Stock Market Analysis
      • Stock Portfolio Analyses Part 1
      • Stock Portfolio Analyses Part 2
      • Build A Dashboard In Python
      • Stock Market Predictions with LSTM
      • Trading example
      • Algorithmic Trading Strategies
      • DOWNLOAD FUNDAMENTALS DATA
      • Algorithmic Trading
      • numfin
      • Financial Machine Learning
      • Algorithm To Predict Stock Direction
      • Interactive Brokers API Code
      • The (Artificially) Intelligent Investor
      • Create Auto-Updating Excel of Stock Market
      • Stock Market Predictions
      • Automate Your Stock Portfolio
      • create an analytics dashboard
      • Bitcoin Price Notifications
      • Portfolio Management
    • WebApp
      • CSS
        • The Best CSS Examples
      • JavaScript
        • Memory Game
      • School Clock
      • Frontend Tutorials & Example
      • Side Menu Bar with sub-menu
      • Create Simple CPU Monitor App
      • Vue.js building a converter app
      • jQuery
        • The Best jQuery Examples
      • Image Slideshow
      • Handle Timezones
      • Text to Speech with Javascript
      • Building Blog for Your Portfolio
      • Responsive Website Layout
      • Maths Homework Generator
  • Books
    • Finance
      • Python for Finance (O'Reilly)
    • Website
      • Hugo
        • Go Bootcamp
        • Hugo in Action.
          • About this MEAP
          • Welcome
          • 1. The JAM stack with Hugo
          • 2. Live in 30 minutes
          • 3. Using Markup for content
          • 4. Content Management with Hugo
          • 5. Custom Pages and Customized Content
          • 6. Structuring web pages
          • A Appendix A.
          • B Appendix B.
          • C Appendix C.
    • Python
      • ภาษาไพธอนเบื้องต้น
      • Python Cheatsheet
        • Python Cheatsheet
      • Beginning Python
      • IPython Cookbook
      • The Quick Python Book
        • Case study
        • Part 1. Starting out
          • 1. About Python
          • 2. Getting started
          • 3. The Quick Python overview
        • Part 2. The essentials
          • 14. Exceptions
          • 13. Reading and writing files
          • 12. Using the filesystem
          • 11. Python programs
          • 10. Modules and scoping rules
          • 9. Functions
          • 8. Control flow
          • 4. The absolute basics
          • 5. Lists, tuples, and sets
          • 6. Strings
          • 7. Dictionaries
        • Part 3. Advanced language features
          • 19. Using Python libraries
          • 18. Packages
          • 17. Data types as objects
          • 16. Regular expressions
          • 15. Classes and OOP
        • Part 4. Working with data
          • Appendix B. Exercise answers
          • Appendix A. Python’s documentation
          • 24. Exploring data
          • 23. Saving data
          • 20. Basic file wrangling
          • 21. Processing data files
          • 22. Data over the network
      • The Hitchhiker’s Guide to Python
      • A Whirlwind Tour of Python
        • 9. Defining Functions
      • Automate the Boring Stuff
        • 4. Lists
        • 5. Dictionaries
        • 12. Web Scraping
        • 13. Excel
        • 14. Google Sheets
        • 15. PDF and Word
        • 16. CSV and JSON
    • IPython
    • Pandas
      • จัดการข้อมูลด้วย pandas เบื้องต้น
      • Pandas Tutorial
  • Link Center
    • Temp
  • เทควันโด
    • รวมเทคนิค
    • Help and Documentation
  • Image
    • Logistics
Powered by GitBook
On this page
  • Who should read this?
  • What you will learn in this article
  • Prerequisites
  • Step 1: Download the source
  • Step 2: Parse the source
  • Step 3: Format the source for further processing
  • Bonus: What about SPAs?
  • Summary

Was this helpful?

  1. Examples
  2. Web Scraper

How Scrape Websites with Python 3

May 17, 2020

PreviousUntitledNextFinance

Last updated 4 years ago

Was this helpful?

Web scraping is the process of extracting data from websites.

Before attempting to scrape a website, you should make sure that the provider allows it in their terms of service. You should also check to see whether you could use an API instead.

Massive scraping can put a server under a lot of stress which can result in a denial of service. And you don't want that.

Who should read this?

This article is for advanced readers. It will assume that you are already familiar with the Python programming language.

At the very minimum you should understand list comprehension, context manager, and functions. You should also know how to set up a virtual environment.

We'll run the code on your local machine to explore some websites. With some tweaks you could make it run on a server as well.

What you will learn in this article

This article will also explain what to do if that website is using JavaScript to render content (like React.js or Angular).

Prerequisites

Before I can start, I want to make sure we're ready to go. Please set up a virtual environment and install the following packages into it:

  • beautifulsoup4 (version 4.9.0 at time of writing)

  • requests (version 2.23.0 at time of writing)

  • wordcloud (version 1.17.0 at time of writing, optional)

  • selenium (version 3.141.0 at time of writing, optional)

Step 1: Download the source

First things first: I create a file urls.txt holding all the URLs I want to download:

https://www.gesetze-im-internet.de/gg/art_1.html
https://www.gesetze-im-internet.de/gg/art_2.html
https://www.gesetze-im-internet.de/gg/art_3.html
https://www.gesetze-im-internet.de/gg/art_4.html
https://www.gesetze-im-internet.de/gg/art_5.html
https://www.gesetze-im-internet.de/gg/art_6.html
https://www.gesetze-im-internet.de/gg/art_7.html
https://www.gesetze-im-internet.de/gg/art_8.html
https://www.gesetze-im-internet.de/gg/art_9.html
https://www.gesetze-im-internet.de/gg/art_10.html
https://www.gesetze-im-internet.de/gg/art_11.html
https://www.gesetze-im-internet.de/gg/art_12.html
https://www.gesetze-im-internet.de/gg/art_12a.html
https://www.gesetze-im-internet.de/gg/art_13.html
https://www.gesetze-im-internet.de/gg/art_14.html
https://www.gesetze-im-internet.de/gg/art_15.html
https://www.gesetze-im-internet.de/gg/art_16.html
https://www.gesetze-im-internet.de/gg/art_16a.html
https://www.gesetze-im-internet.de/gg/art_17.html
https://www.gesetze-im-internet.de/gg/art_17a.html
https://www.gesetze-im-internet.de/gg/art_18.html
https://www.gesetze-im-internet.de/gg/art_19.html

urls.txt

Next, I write a bit of Python code in a file called scraper.py to download the HTML of this files.

In a real scenario, this would be too expensive and you'd use a database instead. To keep things simple, I'll download files into the same directory next to the store and use their name as the filename.

from os import path
from pathlib import PurePath

import requests

with open('urls.txt', 'r') as fh:
    urls = fh.readlines()
urls = [url.strip() for url in urls]  # strip `\n`

for url in urls:
    file_name = PurePath(url).name
    file_path = path.join('.', file_name)
    text = ''

    try:
        response = requests.get(url)
        if response.ok:
            text = response.text
    except requests.exceptions.ConnectionError as exc:
        print(exc)

    with open(file_path, 'w') as fh:
        fh.write(text)

    print('Written to', file_path)

scraper.py

By downloading the files, I can process them locally as much as I want without being dependent on a server. Try to be a good web citizen, okay?

Step 2: Parse the source

Now that I've downloaded the files, it's time to extract their interesting features. Therefore I go to one of the pages I downloaded, open it in a web browser, and hit Ctrl-U to view its source. Inspecting it will show me the HTML structure.

Since I have a second step now, I'm going to refactor the code a bit by putting it into functions and add a minimal CLI.

from os import path
from pathlib import PurePath
import sys

from bs4 import BeautifulSoup
import requests


def download_urls(urls, dir):
    paths = []

    for url in urls:
        file_name = PurePath(url).name
        file_path = path.join(dir, file_name)
        text = ''

        try:
            response = requests.get(url)
            if response.ok:
                text = response.text
            else:
                print('Bad response for', url, response.status_code)
        except requests.exceptions.ConnectionError as exc:
            print(exc)

        with open(file_path, 'w') as fh:
            fh.write(text)

        paths.append(file_path)

    return paths

def parse_html(path):
    with open(path, 'r') as fh:
        content = fh.read()

    return BeautifulSoup(content, 'html.parser')

def download(urls):
    return download_urls(urls, '.')

def extract(path):
    return parse_html(path)

def transform(soup):
    container = soup.find(id='container')
    if container is not None:
        return container.get_text()

def load(key, value):
    d = {}
    d[key] = value
    return d

def run_single(path):
    soup = extract(path)
    content = transform(soup)
    unserialised = load(path, content.strip() if content is not None else '')
    return unserialised

def run_everything():
    l = []

    with open('urls.txt', 'r') as fh:
        urls = fh.readlines()
    urls = [url.strip() for url in urls]

    paths = download(urls)
    for path in paths:
        print('Written to', path)
        l.append(run_single(path))

    print(l)

if __name__ == "__main__":
    args = sys.argv

    if len(args) is 1:
      run_everything()
    else:
        if args[1] == 'download':
            download([args[2]])
            print('Done')
        if args[1] == 'parse':
            path = args[2]
            result = run_single(path)
            print(result)

scraper.py

Now I can run the code in three ways:

  1. Without any arguments to run everything (that is, download all URLs and extract them, then save to disk) via: python scraper.py

  2. With an argument of download and a url to download python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html. This will not process the file.

  3. With an argument of parse and a filepath to parse: python scraper.py art_1.html. This will skip the download step.

With that, there's one last thing missing.

Step 3: Format the source for further processing

Let's say I want to generate a word cloud for each article. This can be a quick way to get an idea about what a text is about. For this, install the package wordcloud and update the file like this:

from os import path
from pathlib import Path, PurePath
import sys

from bs4 import BeautifulSoup
import requests
from wordcloud import WordCloud

STOPWORDS_ADDENDUM = [
    'Das',
    'Der',
    'Die',
    'Diese',
    'Eine',
    'In',
    'InhaltsverzeichnisGrundgesetz',
    'im',
    'Jede',
    'Jeder',
    'Kein',
    'Sie',
    'Soweit',
    'Über'
]
STOPWORDS_FILE_PATH = 'stopwords.txt'
STOPWORDS_URL = 'https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt'


def download_urls(urls, dir):
    paths = []

    for url in urls:
        file_name = PurePath(url).name
        file_path = path.join(dir, file_name)
        text = ''

        try:
            response = requests.get(url)
            if response.ok:
                text = response.text
            else:
                print('Bad response for', url, response.status_code)
        except requests.exceptions.ConnectionError as exc:
            print(exc)

        with open(file_path, 'w') as fh:
            fh.write(text)

        paths.append(file_path)

    return paths

def parse_html(path):
    with open(path, 'r') as fh:
        content = fh.read()

    return BeautifulSoup(content, 'html.parser')

def download_stopwords():
    stopwords = ''

    try:
        response = requests.get(STOPWORDS_URL)
        if response.ok:
            stopwords = response.text
        else:
            print('Bad response for', url, response.status_code)
    except requests.exceptions.ConnectionError as exc:
        print(exc)

    with open(STOPWORDS_FILE_PATH, 'w') as fh:
        fh.write(stopwords)

    return stopwords

def download(urls):
    return download_urls(urls, '.')

def extract(path):
    return parse_html(path)

def transform(soup):
    container = soup.find(id='container')
    if container is not None:
        return container.get_text()

def load(filename, text):
    if Path(STOPWORDS_FILE_PATH).exists():
        with open(STOPWORDS_FILE_PATH, 'r') as fh:
            stopwords = fh.readlines()
    else:
        stopwords = download_stopwords()

    # Strip whitespace around
    stopwords = [stopword.strip() for stopword in stopwords]
    # Extend stopwords with own ones, which were determined after first run
    stopwords = stopwords + STOPWORDS_ADDENDUM

    try:
        cloud = WordCloud(stopwords=stopwords).generate(text)
        cloud.to_file(filename.replace('.html', '.png'))
    except ValueError:
        print('Could not generate word cloud for', key)

def run_single(path):
    soup = extract(path)
    content = transform(soup)
    load(path, content.strip() if content is not None else '')

def run_everything():
    with open('urls.txt', 'r') as fh:
        urls = fh.readlines()
    urls = [url.strip() for url in urls]

    paths = download(urls)
    for path in paths:
        print('Written to', path)
        run_single(path)
    print('Done')

if __name__ == "__main__":
    args = sys.argv

    if len(args) is 1:
      run_everything()
    else:
        if args[1] == 'download':
            download([args[2]])
            print('Done')
        if args[1] == 'parse':
            path = args[2]
            run_single(path)
            print('Done')

scraper.py

Then I instantiate a WordCloud instance with the list of stopwords I downloaded and the text of the law. It will be turned into an image with the same basename.

After the first run, I discover that the list of stopwords is incomplete. So I add additional words I want to exclude from the resulting image.

With that, the main part of web scraping is complete.

Bonus: What about SPAs?

SPAs - or Single Page Applications - are web applications where the whole experience is controlled by JavaScript, which is executed in the browser. As such, downloading the HTML file does not bring us far. What should we do instead?

Since the code will be slower, I create a new file called crawler.py for it. The content looks like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from wordcloud import WordCloud

def extract(url):
    elem = None
    driver = webdriver.Firefox()
    driver.get(url)

    try:
        found = WebDriverWait(driver, 10).until(
            EC.visibility_of(
                driver.find_element(By.TAG_NAME, "article")
            )
        )
        # Make a copy of relevant data, because Selenium will throw if
        # you try to access the properties after the driver quit
        elem = {
          "text": found.text
        }
    finally:
        driver.close()

    return elem

def transform(elem):
    return elem["text"]

def load(text, filepath):
    cloud = WordCloud().generate(text)
    cloud.to_file(filepath)

if __name__ == "__main__":
    url = "https://angular.io/"
    filepath = "angular.png"

    elem = extract(url)
    if elem is not None:
        text = transform(elem)
        load(text, filepath)
    else:
        print("Sorry, could not extract data")

crawler.py

Here, Python is opening a Firefox instance, browsing the website and looking for an <article> element. It is copying over its text into a dictionary, which gets read out in the transform step and turned into a WordCloud during load.

Summary

Thanks for reading this far! Let's summarise what we've learned now:

  1. How to scrape a website with Python's requests package.

  2. How to translate it into a meaningful structure using beautifulsoup.

  3. How to further process that structure into something you can work with.

  4. What to do if the target page is relying on JavaScript.

At the end of this article, you will know how to download a webpage, parse it for interesting information, and format it in a usable format for further processing. This is also known as .

You can find the code for this project in this .

For this example, we are going to scrape the . (Don't worry, I checked their Terms of Service. They offer an XML version for machine processing, but this page serves as an example of processing HTML. So it should be fine.)

In my case, I figured I want the text of the law without any markup. The element wrapping it has an id of container. Using BeautifulSoup I can see that a combination of and [get_text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text) will do what I want.

What changed? For one, I downloaded a from GitHub. This way, I can eliminate the most common words from the downloaded law text.

We'll use the browser. With Selenium. Make sure to also. Download the .tar.gz archive and unpack it in the bin folder of your virtual environment so it will be found by Selenium. That is the directory where you can find the activate script (on GNU/Linux systems).

As an example, I am using the here. Angular is a popular SPA-Framework written in JavaScript and guaranteed to be controlled by it for the time being.

When dealing with JavaScript-heavy sites, it is often useful to use and perhaps run even [execute_script](https://selenium-python.readthedocs.io/api.html#selenium.webdriver.remote.webdriver.WebDriver.execute_script)to defer to JavaScript if needed.

Reference :

ETL
git repository on GitHub
Basic Law for the Federal Republic of Germany
find
list of German stopwords
install a driver
Angular website
Waits
https://www.freecodecamp.org/news/webscraping-in-python/