📉
Tutorials
  • Computer History
  • Function
    • Finance
      • Calculate
    • Manage Data
    • Date&Time
    • Strings and Character
  • Snippets
    • Web Application
      • Hugo
      • JavaScript
        • Stopwatch using JavaScript?
    • Note
    • Start Project
      • GitHub
      • GitLab
    • Python Programming
      • Strings and Character Data
      • List
      • Dictionaries
    • Data Science
      • Setting Option
      • Get Data
  • Link Center
    • Next Articles
    • Google
    • Excel VBA
    • Python
      • Notebook
    • WebApp
      • Vue.js
    • Finance
    • Project
      • Kids
        • Scratch
      • Finance
        • Plotly.js
        • Portfolio
      • Mini Lab
        • Systems Administration
        • Auto Adjust Image
      • Sending Emails
      • ECS
        • Knowledge Base
        • ระบบผู้เชี่ยวชาญ (Expert System)
        • Check product
        • Compare two SQL databases
      • e-Library
        • Knowledge base
        • การจัดหมวดหมู่ห้องสมุด
        • Temp
      • AppSheet
        • บัญชีรายรับรายจ่าย
      • Weather App
      • COVID-19
  • Tutorials
    • Data Science
      • Data Science IPython notebooks
    • UX & UI
      • 7 กฎการออกแบบ UI
    • Web Scraping
      • Scrape Wikipedia Articles
      • Quick Start
    • GUI
      • pysimple
        • Create a GUI
      • Tkinter
        • Python Tkinter Tutorial
      • PyQt
        • PyQt Tutorial
    • MachineLearning
      • การพัฒนา Chat Bot
      • AI ผู้ช่วยใหม่ในการทำ Customer Segmentation
      • Customer Segmentation
      • ตัดคำภาษาไทย ด้วย PyThaiNLP API
    • Excel & VBA
      • INDEX กับ MATCH
      • รวมสูตร Excel ปี 2020
      • How to Write Code in a Spreadsheet
    • Visualization
      • Bokeh
        • Part I: Getting Started
        • Data visualization
        • Plotting a Line Graph
        • Panel Document
        • Interactive Data Visualization
    • VueJS
      • VueJS - Quick Guide
    • Django
      • Customize the Django Admin
      • พัฒนาเว็บด้วย Django
    • Git
      • วิธีสร้าง SSH Key
      • Git คืออะไร
      • เริ่มต้นใช้งาน Git
      • การใช้งาน Git และ Github
      • รวม 10 คำสั่ง Git
      • GIT Push and Pull
    • Finance
      • Stock Analysis using Pandas (Series)
      • Building Investment AI for fintech
      • Resampling Time Series
      • Python for Finance (Series)
      • Stock Data Analysis (Second Edition)
      • Get Stock Data Using Python
      • Stock Price Trend Analysis
      • Calculate Stock Returns
      • Quantitative Trading
      • Backtrader for Backtesting
      • Binance Python API
      • Pine Script (TradingView)
      • Stocks Analysis with Pandas and Scikit-Learn
      • Yahoo Finance API
      • Sentiment Analysis
      • yfinance Library
      • Stock Data Analysis
      • YAHOO_FIN
      • Algorithmic Trading
    • JavaScript
      • Split a number
      • Callback Function
      • The Best JavaScript Examples
      • File and FileReader
      • JavaScript Tutorial
      • Build Reusable HTML Components
      • Developing JavaScript components
      • JavaScript - Quick Guide
      • JavaScript Style Guide()
      • Beginner's Handbook
      • Date Now
    • Frontend
      • HTML
        • File Path
      • Static Site Generators.
        • Creating a New Theme
    • Flask
      • Flask - Quick Guide
      • Flask Dashboards
        • Black Dashboard
        • Light Blue
        • Flask Dashboard Argon
      • Create Flask App
        • Creating First Application
        • Rendering Pages Using Jinja
      • Jinja Templates
        • Primer on Jinja Templating
        • Jinja Template Document
      • Learning Flask
        • Ep.1 Your first Flask app
        • Ep.2 Flask application structure
        • Ep.3 Serving HTML files
        • Ep.4 Serving static files
        • Ep.5 Jinja template inheritance
        • Ep.6 Jinja template design
        • Ep.7 Working with forms in Flask
        • Ep.8 Generating dynamic URLs in Flask
        • Ep.9 Working with JSON data
        • Ep.23 Deploying Flask to a VM
        • Ep.24 Flask and Docker
        • Ep. 25: uWSGI Introduction
        • Ep. 26 Flask before and after request
        • Ep. 27 uWSGI Decorators
        • Ep. 28 uWSGI Decorators
        • Ep. 29 Flask MethodView
        • Ep. 30 Application factory pattern
      • The Flask Mega-Tutorial
        • Chapter 2: Templates
      • Building Flask Apps
      • Practical Flask tutorial series
      • Compiling SCSS to CSS
      • Flask application structure
    • Database
      • READING FROM DATABASES
      • SQLite
        • Data Management
        • Fast subsets of large datasets
      • Pickle Module
        • How to Persist Objects
      • Python SQL Libraries
        • Create Python apps using SQL Server
    • Python
      • Python vs JavaScript
      • Python Pillow – Adjust Image
      • Python Library for Google Search
      • Python 3 - Quick Guide
      • Regular Expressions
        • Python Regular Expressions
        • Regular Expression (RegEx)
        • Validate ZIP Codes
        • Regular Expression Tutorial
      • Python Turtle
      • Python Beginner's Handbook
      • From Beginner to Pro
      • Standard Library
      • Datetime Tutorial
        • Manipulate Times, Dates, and Time Spans
      • Work With a PDF
      • geeksforgeeks.org
        • Python Tutorial
      • Class
      • Modules
        • Modules List
        • pickle Module
      • Working With Files
        • Open, Read, Append, and Other File Handling
        • File Manipulation
        • Reading & Writing to text files
      • Virtual Environments
        • Virtual Environments made easy
        • Virtual Environmen
        • A Primer
        • for Beginners
      • Functions
        • Function Guide
        • Inner Functions
      • Learning Python
        • Pt. 4 Python Strings
        • Pt. 3 Python Variables
      • Zip Function
      • Iterators
      • Try and Except
        • Exceptions: Introduction
        • Exceptions Handling
        • try and excep
        • Errors and Exceptions
        • Errors & Exceptions
      • Control Flow
      • Lambda Functions
        • Lambda Expression คืออะไร
        • map() Function
      • Date and Time
        • Python datetime()
        • Get Current Date and Time
        • datetime in Python
      • Awesome Python
      • Dictionary
        • Dictionary Comprehension
        • ALL ABOUT DICTIONARIES
        • DefaultDict Type for Handling Missing Keys
        • The Definitive Guide
        • Why Functions Modify Lists and Dictionaries
      • Python Structures
      • Variable & Data Types
      • List
        • Lists Explained
        • List Comprehensions
          • Python List Comprehension
          • List Comprehensions in 5-minutes
          • List Comprehension
        • Python List
      • String
        • Strings and Character Data
        • Splitting, Concatenating, and Joining Strings
      • String Formatting
        • Improved String Formatting Syntax
        • String Formatting Best Practices
        • Remove Space
        • Add Spaces
      • Important basic syntax
      • List all the packages
      • comment
    • Pandas
      • Tutorial (GeeksforGeeks)
      • 10 minutes to pandas
      • Options and settings
      • เริ่มต้น Set Up Kaggle.com
      • Pandas - Quick Guide
      • Cookbook
      • NumPy
        • NumPy Package for Scientific
      • IO tools (text, CSV, …)
      • pandas.concat
      • Excel & Google Sheets
        • A Guide to Excel
        • Quickstart to the Google Sheets
        • Python Excel Tutorial: The Definitive Guide
      • Working With Text Data
        • Quickstart
      • API Reference
      • Groupby
      • DateTime Methods
      • DataFrame
      • sort_values()
      • Pundit: Accessing Data in DataFrames
      • datatable
        • DataFrame: to_json()
        • pydatatable
      • Read and Write Files
      • Data Analysis with Pandas
      • Pandas and Python: Top 10
      • 10 minutes to pandas
      • Getting Started with Pandas in Python
    • Markdown
      • Create Responsive HTML Emails
      • Using Markup Languages with Hugo
    • AngularJS
      • Learn AngularJS
    • CSS
      • The CSS Handbook
      • Box Shadow
      • Image Center
      • The CSS Handbook
      • The CSS Handbook
      • Loading Animation
      • CSS Grid Layout
      • Background Image Size
      • Flexbox
  • Series
    • จาวาสคริปต์เบื้องต้น
      • 1: รู้จักกับจาวาสคริปต์
  • Articles
    • Visualization
      • Dash
        • Introducing Dash
    • Finance
      • PyPortfolioOpt
      • Best Libraries for Finance
      • Detection of price support
      • Portfolio Optimization
      • Python Packages For Finance
    • Django
      • เริ่มต้น Django RestFramework
    • General
      • Heroku คืออะไร
      • How to Crack Passwords
    • Notebook
      • IPython Documentation
      • Importing Notebooks
      • Google Colab for Data Analytics
      • Creating Interactive Dashboards
      • The Definitive Guide
      • A gallery of interesting Jupyter Notebooks
      • Advanced Jupyter Notebooks
      • Converting HTML to Notebook
    • Pandas
      • Pandas_UI
      • Pandas Style API
      • Difference Between two Dataframes
      • 19 Essential Snippets in Pandas
      • Time Series Analysis
      • Selecting Columns in a DataFrame
      • Cleaning Up Currency Data
      • Combine Multiple Excel Worksheets
      • Stylin’ with Pandas
      • Pythonic Data Cleaning
      • Make Excel Faster
      • Reading Excel (xlsx) Files
      • How to use iloc and loc for Indexing
      • The Easiest Data Cleaning Method
    • Python
      • pip install package
      • Automating your daily tasks
      • Convert Speech to Text
      • Tutorial, Project Ideas, and Tips
      • Image Handling and Processing
        • Image Processing Part I
        • Image Processing Part II
        • Image tutorial
        • Image Processing with Numpy
        • Converts PIL Image to Numpy Array
      • Convert Dictionary To JSON
      • JSON Dump
      • Speech-to-Text Model
      • Convert Text to Speech
      • Tips & Tricks
        • Fundamentals for Data Science
        • Best Python Code Examples
        • Top 50 Tips & Tricks
        • 11 Beginner Tips
        • 10 Tips & Tricks
      • Password hashing
      • psutil
      • Lambda Expressions
    • Web Scraping
      • Web Scraping using Python
      • Build a Web Scraper
      • Web Scraping for beginner
      • Beautiful Soup
      • Scrape Websites
      • Python Web Scraping
        • Web Scraping Part 1
        • Web Scraping Part 2
        • Web Scraping Part 3
        • Web Scraping Part 4
      • Web Scraper
    • Frontend
      • Book Online with GitBook
      • Progressive Web App คืออะไร
      • self-host a Hugo web app
  • Examples
    • Django
      • Build a Portfolio App
      • SchoolManagement
    • Flask
      • Flask Stock Visualizer
      • Flask by Example
      • Building Flask Apps
      • Flask 101
    • OpenCV
      • Build a Celebrity Look-Alike
      • Face Detection-OpenCV
    • Python
      • Make Game FLASH CARD
      • Sending emails using Google
      • ตรวจหาภาพซ้ำด้วย Perceptual hashing
        • Sending Emails in Python
      • Deck of Cards
      • Extract Wikipedia Data
      • Convert Python File to EXE
      • Business Machine Learning
      • python-business-analytics
      • Simple Blackjack Game
      • Python Turtle Clock
      • Countdown
      • 3D Animation : Moon Phases
      • Defragmentation Algorithm
      • PDF File
        • จัดการข้อความ และรูป จากไฟล์ PDF ด้วย PDFBox
      • Reading and Generating QR codes
      • Generating Password
        • generate one-time password (OTP)
        • Random Password Generator
        • Generating Strong Password
      • PyQt: Building Calculator
      • List Files in a Directory
      • [Project] qID – โปรแกรมแต่งรูปง่ายๆ เพื่อการอัพลงเว็บ
      • Python and Google Docs to Build Books
      • Tools for Record Linking
      • Create Responsive HTML Email
      • psutil()
      • Transfer Learning for Deep Learning
      • ดึงข้อมูลคุณภาพอากาศประเทศไทย
        • Image Classification
    • Web Scraper
      • Scrape Wikipedia Articles
        • Untitled
      • How Scrape Websites with Python 3
    • Finance
      • Algorithmic Trading for Beginners
      • Parse TradingView Stock
      • Creating a stock price database with MariaDB and python
      • Source Code
        • stocks-list
      • Visualizing with D3
      • Real Time Stock in Excel using Python
      • Create Stock Quote Module
      • The Magic Formula Lost Its Sparkle?
      • Stock Market Analysis
      • Stock Portfolio Analyses Part 1
      • Stock Portfolio Analyses Part 2
      • Build A Dashboard In Python
      • Stock Market Predictions with LSTM
      • Trading example
      • Algorithmic Trading Strategies
      • DOWNLOAD FUNDAMENTALS DATA
      • Algorithmic Trading
      • numfin
      • Financial Machine Learning
      • Algorithm To Predict Stock Direction
      • Interactive Brokers API Code
      • The (Artificially) Intelligent Investor
      • Create Auto-Updating Excel of Stock Market
      • Stock Market Predictions
      • Automate Your Stock Portfolio
      • create an analytics dashboard
      • Bitcoin Price Notifications
      • Portfolio Management
    • WebApp
      • CSS
        • The Best CSS Examples
      • JavaScript
        • Memory Game
      • School Clock
      • Frontend Tutorials & Example
      • Side Menu Bar with sub-menu
      • Create Simple CPU Monitor App
      • Vue.js building a converter app
      • jQuery
        • The Best jQuery Examples
      • Image Slideshow
      • Handle Timezones
      • Text to Speech with Javascript
      • Building Blog for Your Portfolio
      • Responsive Website Layout
      • Maths Homework Generator
  • Books
    • Finance
      • Python for Finance (O'Reilly)
    • Website
      • Hugo
        • Go Bootcamp
        • Hugo in Action.
          • About this MEAP
          • Welcome
          • 1. The JAM stack with Hugo
          • 2. Live in 30 minutes
          • 3. Using Markup for content
          • 4. Content Management with Hugo
          • 5. Custom Pages and Customized Content
          • 6. Structuring web pages
          • A Appendix A.
          • B Appendix B.
          • C Appendix C.
    • Python
      • ภาษาไพธอนเบื้องต้น
      • Python Cheatsheet
        • Python Cheatsheet
      • Beginning Python
      • IPython Cookbook
      • The Quick Python Book
        • Case study
        • Part 1. Starting out
          • 1. About Python
          • 2. Getting started
          • 3. The Quick Python overview
        • Part 2. The essentials
          • 14. Exceptions
          • 13. Reading and writing files
          • 12. Using the filesystem
          • 11. Python programs
          • 10. Modules and scoping rules
          • 9. Functions
          • 8. Control flow
          • 4. The absolute basics
          • 5. Lists, tuples, and sets
          • 6. Strings
          • 7. Dictionaries
        • Part 3. Advanced language features
          • 19. Using Python libraries
          • 18. Packages
          • 17. Data types as objects
          • 16. Regular expressions
          • 15. Classes and OOP
        • Part 4. Working with data
          • Appendix B. Exercise answers
          • Appendix A. Python’s documentation
          • 24. Exploring data
          • 23. Saving data
          • 20. Basic file wrangling
          • 21. Processing data files
          • 22. Data over the network
      • The Hitchhiker’s Guide to Python
      • A Whirlwind Tour of Python
        • 9. Defining Functions
      • Automate the Boring Stuff
        • 4. Lists
        • 5. Dictionaries
        • 12. Web Scraping
        • 13. Excel
        • 14. Google Sheets
        • 15. PDF and Word
        • 16. CSV and JSON
    • IPython
    • Pandas
      • จัดการข้อมูลด้วย pandas เบื้องต้น
      • Pandas Tutorial
  • Link Center
    • Temp
  • เทควันโด
    • รวมเทคนิค
    • Help and Documentation
  • Image
    • Logistics
Powered by GitBook
On this page
  • Web Scraping With Python: Beautiful Soup
  • Conclusion

Was this helpful?

  1. Articles
  2. Web Scraping

Beautiful Soup

22 Apr 2020

PreviousWeb Scraping for beginnerNextScrape Websites

Last updated 5 years ago

Was this helpful?

Web Scraping With Python: Beautiful Soup

Learn what web scraping is and how it can be achieved with the help of Python's beautiful soup library. Learn by using Amazon website data.

If you are just getting started in Python and would like to learn more, take DataCamp's course.

Source

In the time when the internet is rich with so much data, and apparently, data has become the new oil, web scraping has become even more important and practical to use in various applications. Web scraping deals with extracting or scraping the information from the website. Web scraping is also sometimes referred to as web harvesting or web data extraction. Copying text from a website and pasting it to your local system is also web scraping. However, it is a manual task. Generally, web scraping deals with extracting data automatically with the help of web crawlers. Web crawlers are scripts that connect to the world wide web using the HTTP protocol and allows you to fetch data in an automated manner.

Whether you are a data scientist, engineer, or anybody who analyzes vast amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways.

Some of the practical applications of web scraping could be:

  • Gathering resume of candidates with a specific skill,

  • Extracting tweets from twitter with specific hashtags,

  • Lead generation in marketing,

  • Scraping product details and reviews from e-commerce websites.

Apart from the above use-cases, web scraping is widely used in natural language processing for extracting text from the websites for training a deep learning model.

Potential Challenges of Web Scraping

  • One of the challenges you would come across while scraping information from websites is the various structures of websites. Meaning, the templates of websites will differ and will be unique; hence, generalizing across websites could be a challenge.

  • Another challenge could be longevity. Since the web developers keep updating their websites, you cannot certainly rely on one scraper for too long. Even though the modifications might be minor, but they still might create a hindrance for you while fetching the data.

Hence, to address the above challenges, there could be various possible solutions. One would be to follow continuous integration & development (CI/CD) and constant maintenance as the website modifications would be dynamic.

Another more realistic approach is to use Application Programming Interfaces (APIs) offered by various websites & platforms. For example, Facebook and twitter provide you API's specially designed for developers who want to experiment with their data or would like extract information to let's say related to all friends & mutual friends and draw a connection graph of it. The format of the data when using APIs is different from usual web scraping i.e., JSON or XML, while in standard web scraping, you mainly deal with data in HTML format.

What is Beautiful Soup?

Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.

  • It usually saves programmers hours or days of work since it works with your favorite parsers like lxml and html5lib to provide organic Python ways of navigating, searching, and modifying the parse tree.

  • Another powerful and useful feature of beautiful soup is its intelligence to convert the documents being fetched to Unicode and outgoing documents to UTF-8. As a developer, you do not have to take care of that unless the document intrinsic doesn't specify an encoding or Beautiful Soup is unable to detect one.

  • It is also considered to be faster when compared to other general parsing or scraping techniques.

Types of Parsers

Enough of theory, right? So, let's install beautiful soup and start learning about its features and capabilities using Python.

As a first step, you need to install the Beautiful Soup library using your terminal or jupyter lab. The best way to install beautiful soup is via pip, so make sure you have the pip module already installed.

!pip3 install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/site-packages (4.7.1)
Requirement already satisfied: soupsieve>=1.2 in /usr/local/lib/python3.7/site-packages (from beautifulsoup4) (1.9.5)

Importing necessary libraries

Let's import the required packages which you will use to scrape the data from the website and visualize it with the help of seaborn, matplotlib, and bokeh.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

Scraping the Amazon Best Selling Books

To connect to the URL and fetch the HTML content following things are required:

  • Define a get_data function which will input the page numbers as an argument,

  • Define a user-agent which will help in bypassing the detection as a scraper,

  • Specify the URL to requests.get and pass the user-agent header as an argument,

  • Extract the content from requests.get,

  • Scrape the specified page and assign it to soup variable,

Next and the important step is to identify the parent tag under which all the data you need will reside. The data that you are going to extract is:

  • Book Name

  • Author

  • Rating

  • Customers Rated

  • Price

The below image shows where the parent tag is located, and when you hover over it, all the required elements are highlighted.

Similar to the parent tag, you need to find the attributes for book name, author, rating, customers rated, and price. You will have to go to the webpage you would like to scrape, select the attribute and right-click on it, and select inspect element. This will help you in finding out the specific information fields you need an extract from the sheer HTML web page, as shown in the figure below:

Note that some author names are not registered with Amazon, so you need to apply extra find for those authors. In the below cell code, you would find nested if-else conditions for author names, which are to extract the author/publication names.

no_pages = 2

def get_data(pageNo):  
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
    content = r.content
    soup = BeautifulSoup(content)
    #print(soup)

    alls = []
    for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}):
        #print(d)
        name = d.find('span', attrs={'class':'zg-text-center-align'})
        n = name.find_all('img', alt=True)
        #print(n[0]['alt'])
        author = d.find('a', attrs={'class':'a-size-small a-link-child'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('a', attrs={'class':'a-size-small a-link-normal'})
        price = d.find('span', attrs={'class':'p13n-sc-price'})

        all1=[]

        if name is not None:
            #print(n[0]['alt'])
            all1.append(n[0]['alt'])
        else:
            all1.append("unknown-product")

        if author is not None:
            #print(author.text)
            all1.append(author.text)
        elif author is None:
            author = d.find('span', attrs={'class':'a-size-small a-color-base'})
            if author is not None:
                all1.append(author.text)
            else:    
                all1.append('0')

        if rating is not None:
            #print(rating.text)
            all1.append(rating.text)
        else:
            all1.append('-1')

        if users_rated is not None:
            #print(price.text)
            all1.append(users_rated.text)
        else:
            all1.append('0')     

        if price is not None:
            #print(price.text)
            all1.append(price.text)
        else:
            all1.append('0')
        alls.append(all1)    
    return alls

The below code cell will perform the following functions:

  • Call the get_data function inside a for loop,

  • The for loop will iterate over this function starting from 1 till the number of pages+1.

  • Since the output will be a nested list, you would first flatten the list and then pass it to the DataFrame.

  • Finally, save the dataframe as a CSV file.

results = []
for i in range(1, no_pages+1):
    results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price'])
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')

Reading CSV File

Now let's load the CSV file you created and save in the above cell. Again, this is an optional step; you could even use the dataframe df directly and ignore the below step.

df = pd.read_csv("amazon_products.csv")
df.shape
(100, 5)

The shape of the dataframe reveals that there are 100 rows and 5 columns in your CSV file.

Let's print the first 5 rows of the dataset.

df.head(61)

Book Name

Author

Rating

Customers_Rated

Price

0

The Power of your Subconscious Mind

Joseph Murphy

4.5 out of 5 stars

13,948

₹ 99.00

1

Think and Grow Rich

Napoleon Hill

4.5 out of 5 stars

16,670

₹ 99.00

2

Word Power Made Easy

Norman Lewis

4.4 out of 5 stars

10,708

₹ 130.00

3

Mathematics for Class 12 (Set of 2 Vol.) Exami...

R.D. Sharma

4.5 out of 5 stars

18

₹ 930.00

4

The Girl in Room 105

Chetan Bhagat

4.3 out of 5 stars

5,162

₹ 149.00

...

...

...

...

...

...

56

COMBO PACK OF Guide To JAIIB Legal Aspects Pri...

MEC MILLAN

4.5 out of 5 stars

114

₹ 1,400.00

57

Wren & Martin High School English Grammar and ...

Rao N

4.4 out of 5 stars

1,613

₹ 400.00

58

Objective General Knowledge

Sanjiv Kumar

4.2 out of 5 stars

742

₹ 254.00

59

The Rudest Book Ever

Shwetabh Gangwar

4.6 out of 5 stars

1,177

₹ 194.00

60

Sita: Warrior of Mithila (Ram Chandra Series -...

Amish Tripathi

4.4 out of 5 stars

3,110

₹ 248.00

61 rows × 5 columns

Let's do some preprocessing on the ratings, customers_rated, and price column.

  • Since you know the ratings are out of 5, you can keep only the rating and remove the extra part from it.

  • From the customers_rated column, remove the comma.

  • From the price column, remove the rupees symbol, comma, and split it by dot.

  • Finally, convert all the three columns into integer or float.

df['Rating'] = df['Rating'].apply(lambda x: x.split()[0])
df['Rating'] = pd.to_numeric(df['Rating'])
df["Price"] = df["Price"].str.replace('₹', '')
df["Price"] = df["Price"].str.replace(',', '')
df['Price'] = df['Price'].apply(lambda x: x.split('.')[0])
df['Price'] = df['Price'].astype(int)
df["Customers_Rated"] = df["Customers_Rated"].str.replace(',', '')
df['Customers_Rated'] = pd.to_numeric(df['Customers_Rated'], errors='ignore')
df.head()

Book Name

Author

Rating

Customers_Rated

Price

0

The Power of your Subconscious Mind

Joseph Murphy

4.5

13948

99

1

Think and Grow Rich

Napoleon Hill

4.5

16670

99

2

Word Power Made Easy

Norman Lewis

4.4

10708

130

3

Mathematics for Class 12 (Set of 2 Vol.) Exami...

R.D. Sharma

4.5

18

930

4

The Girl in Room 105

Chetan Bhagat

4.3

5162

149

Let's verify the data types of the DataFrame.

df.dtypes
Book Name           object
Author              object
Rating             float64
Customers_Rated      int64
Price                int64
dtype: object

Replace the zero values in the DataFrame to NaN.

df.replace(str(0), np.nan, inplace=True)
df.replace(0, np.nan, inplace=True)

Counting the Number of NaNs in the DataFrame

count_nan = len(df) - df.count()
count_nan
Book Name          0
Author             6
Rating             0
Customers_Rated    0
Price              1
dtype: int64

From the above output, you can observe that there is a total of six books that do not have an Author Name, while one book does not have a price associated with it. These pieces of information are crucial for an author who wants to sell his or her books and should not neglect to put such information.

Let's drop these NaNs.

df = df.dropna()

Authors Highest Priced Book

Let's find out which all authors had the highest-priced book. You will visualize the results for such the top 20 authors.

data = df.sort_values(["Price"], axis=0, ascending=False)[:15]
data

Book Name

Author

Rating

Customers_Rated

Price

56

COMBO PACK OF Guide To JAIIB Legal Aspects Pri...

MEC MILLAN

4.5

114

1400.0

98

Diseases of Ear, Nose and Throat

P L Dhingra

4.7

118

1285.0

3

Mathematics for Class 12 (Set of 2 Vol.) Exami...

R.D. Sharma

4.5

18

930.0

96

Madhymik Bhautik Vigyan -12 (Part 1-2) (NCERT ...

Kumar-Mittal

5.0

1

765.0

6

My First Library: Boxset of 10 Board Books for...

Wonder House Books

4.5

3116

750.0

38

Indian Polity - For Civil Services and Other S...

M. Laxmikanth

4.6

1210

700.0

42

A Modern Approach to Verbal & Non-Verbal Reaso...

R.S. Aggarwal

4.4

1822

675.0

27

The Intelligent Investor (English) Paperback –...

Benjamin Graham

4.4

6201

650.0

99

Law of CONTRACT & Specific Relief

Dr. Avtar Singh

4.4

23

643.0

49

All In One ENGLISH CORE CBSE Class 12 2019-20

Arihant Experts

4.4

493

599.0

72

The Secret

Rhonda Byrne

4.5

11220

556.0

86

How to Prepare for Quantitative Aptitude for t...

Arun Sharma

4.4

847

537.0

8

Quantitative Aptitude for Competitive Examinat...

R S Aggarwal

4.4

4553

435.0

16

Sapiens: A Brief History of Humankind

Yuval Noah Harari

4.6

14985

434.0

84

Concept of Physics Part-2 (2019-2020 Session) ...

H.C. Verma

4.6

1807

433.0

from bokeh.models import ColumnDataSource
from bokeh.transform import dodge
import math
from bokeh.io import curdoc
curdoc().clear()
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Legend
output_notebook()

Loading BokehJS ...

p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=550, title="Authors Highest Priced Book", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,4], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2
show(p)

From the above graph, you can observe that the top two highest-priced books are by the author Mecmillan and P L Dhingra.

Top Rated Books and Authors wrt Customers Rated

Let's find out which authors have the top-rated books and which books of those authors are top rated. However, while finding this out, you would filter out those authors in which less than 1000 customers rated.

data = df[df['Customers_Rated'] > 1000]
data = data.sort_values(['Rating'],axis=0, ascending=False)[:15]
data

Book Name

Author

Rating

Customers_Rated

Price

26

Inner Engineering: A Yogi’s Guide to Joy

Sadhguru

4.7

4091

254.0

70

Bhagavad-Gita (Hindi)

A. C. Bhaktivedanta

4.7

1023

150.0

11

The Alchemist

Paulo Coelho

4.7

22182

264.0

47

Harry Potter and the Philosopher's Stone

J.K. Rowling

4.7

7737

234.0

84

Concept of Physics Part-2 (2019-2020 Session) ...

H.C. Verma

4.6

1807

433.0

16

Sapiens: A Brief History of Humankind

Yuval Noah Harari

4.6

14985

434.0

38

Indian Polity - For Civil Services and Other S...

M. Laxmikanth

4.6

1210

700.0

29

Wings of Fire: An Autobiography of Abdul Kalam

Arun Tiwari

4.6

3513

301.0

39

The Theory of Everything

Stephen Hawking

4.6

2004

199.0

25

The Immortals of Meluha (Shiva Trilogy)

Amish

4.6

4538

248.0

23

Life's Amazing Secrets: How to Find Balance an...

Gaur Gopal Das

4.6

3422

213.0

34

Dear Stranger, I Know How You Feel

Ashish Bagrecha

4.6

1130

167.0

17

The Monk Who Sold His Ferrari

Robin Sharma

4.6

5877

137.0

13

How to Win Friends and Influence People

Dale Carnegie

4.6

15377

99.0

59

The Rudest Book Ever

Shwetabh Gangwar

4.6

1177

194.0

p = figure(x_range=data.iloc[:,0], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,0], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2
show(p)

From the above output, you can observe that the top three rated books with more than 1000 customer ratings are Inner Engineering: A Yogi’s Guide to Joy, Bhagavad-Gita (Hindi), and The Alchemist.

p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2
show(p)

The above graph shows the top 10 authors in descending order who have the highest rated books with more than 1000 customer ratings, which are Sadhguru, A. C. Bhaktivedanta and Paulo Coelho.

Most Customer Rated Authors and Books

While you have already seen the top-rated books and top-rated authors, it would still be more convincing and credible to conclude the best author and the book based on the number of customers who rated for that book.

So, let's quickly find that out.

data = df.sort_values(["Customers_Rated"], axis=0, ascending=False)[:20]
data

Book Name

Author

Rating

Customers_Rated

Price

11

The Alchemist

Paulo Coelho

4.7

22182

264.0

1

Think and Grow Rich

Napoleon Hill

4.5

16670

99.0

13

How to Win Friends and Influence People

Dale Carnegie

4.6

15377

99.0

16

Sapiens: A Brief History of Humankind

Yuval Noah Harari

4.6

14985

434.0

18

Rich Dad Poor Dad : What The Rich Teach Their ...

Robert T. Kiyosaki

4.5

14591

296.0

10

The Subtle Art of Not Giving a F*ck

Mark Manson

4.4

14418

365.0

0

The Power of your Subconscious Mind

Joseph Murphy

4.5

13948

99.0

48

The Power of Your Subconscious Mind

Joseph Murphy

4.5

13948

99.0

72

The Secret

Rhonda Byrne

4.5

11220

556.0

41

1984

George Orwell

4.5

10829

95.0

2

Word Power Made Easy

Norman Lewis

4.4

10708

130.0

46

Man's Search For Meaning: The classic tribute ...

Viktor E Frankl

4.4

8544

245.0

67

The 7 Habits of Highly Effective People

R. Stephen Covey

4.3

8229

397.0

47

Harry Potter and the Philosopher's Stone

J.K. Rowling

4.7

7737

234.0

40

One Indian Girl

Chetan Bhagat

3.8

7128

113.0

65

Thinking, Fast and Slow (Penguin Press Non-Fic...

Daniel Kahneman

4.4

7087

410.0

27

The Intelligent Investor (English) Paperback –...

Benjamin Graham

4.4

6201

650.0

17

The Monk Who Sold His Ferrari

Robin Sharma

4.6

5877

137.0

53

Ram - Scion of Ikshvaku (Ram Chandra)

Amish Tripathi

4.2

5766

262.0

93

The Richest Man in Babylon

George S. Clason

4.5

5694

129.0

from bokeh.transform import factor_cmap
from bokeh.models import Legend
from bokeh.palettes import Dark2_5 as palette
import itertools
from bokeh.palettes import d3
#colors has a list of colors which can be used in plots
colors = itertools.cycle(palette)

palette = d3['Category20'][20]
index_cmap = factor_cmap('Author', palette=palette,
                         factors=data["Author"])
p = figure(plot_width=700, plot_height=700, title = "Top Authors: Rating vs. Customers Rated")
p.scatter('Rating','Customers_Rated',source=data,fill_alpha=0.6, fill_color=index_cmap,size=20,legend='Author')
p.xaxis.axis_label = 'RATING'
p.yaxis.axis_label = 'CUSTOMERS RATED'
p.legend.location = 'top_left'
BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead
show(p)

The above graph is a scatter plot of Authors who bagged customer rating vs. actual rating. The following conclusions can be made after looking at the above plot.

  • Hands down Paulo Coelho's book The Alchemist is the best selling book since the rating, and the number of customers rated are both in sync.

  • The author Amish Tripathi's book Ram - Scion of Ikshvaku (Ram Chandra) has a rating of 4.2 with a 5766 customer rating. However, the author George S. Clason's book The Richest Man in Babylon has almost similar customers rating, but the overall rating is 4.5. Hence, it can be concluded that more customers gave a high rating to The Richest Man in Babylon.

Conclusion

Congratulations on finishing the tutorial.

This tutorial was a basic introduction to web scraping with beautiful soup and how you can make sense out of the information extracted from the web by visualizing it using the bokeh plotting library. A good exercise to take a step forward in learning web scraping with beautiful soup is to scrape data from some other websites and see how you can get insights from it.

Source

Feel free to read more about it from .

This URL that you are going to scrape is the following: ) (If you can't access this link, here is the ). As you can see, the page argument can be modified to access data for each page. Hence, to access all the pages you will need to loop through all the pages to get the necessary dataset, but first, you need to find out the number of pages from the website.

If you are just getting started in Python and would like to learn more, take DataCamp's course.

Reference :

here
https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo
parent link
Introduction to Data Science in Python
https://www.datacamp.com/community/tutorials/amazon-web-scraping-using-beautifulsoup
Introduction to Data Science in Python