12. Web Scraping
Web Scraping
In those rare, terrifying moments when Iâm without Wi-Fi, I realize just how much of what I do on the computer is really what I do on the Internet. Out of sheer habit Iâll find myself trying to check email, read friendsâ Twitter feeds, or answer the question, âDid Kurtwood Smith have any major roles before he was in the original 1987 Robocop?â[2]
Since so much work on a computer involves going on the Internet, itâd be great if your programs could get online. Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.
webbrowser
. Comes with Python and opens a browser to a specific page.Requests. Downloads files and web pages from the Internet.
Beautiful Soup. Parses HTML, the format that web pages are written in.
Selenium. Launches and controls a web browser. Selenium is able to fill in forms and simulate mouse clicks in this browser.
Project: mapit.py with the webbrowser Module
The webbrowser
moduleâs open()
function can launch a new browser to a specified URL. Enter the following into the interactive shell:
A web browser tab will open to the URL http://inventwithpython.com/. This is about the only thing the webbrowser
module can do. Even so, the open()
function does make some interesting things possible. For example, itâs tedious to copy a street address to the clipboard and bring up a map of it on Google Maps. You could take a few steps out of this task by writing a simple script to automatically launch the map in your browser using the contents of your clipboard. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.
This is what your program does:
Gets a street address from the command line arguments or clipboard.
Opens the web browser to the Google Maps page for the address.
This means your code will need to do the following:
Read the command line arguments from
sys.argv
.Read the clipboard contents.
Call the
webbrowser.open()
function to open the web browser.
Open a new file editor window and save it as mapIt.py.
Step 1: Figure Out the URL
Based on the instructions in Appendix B, set up mapIt.py so that when you run it from the command line, like so...
... the script will use the command line arguments instead of the clipboard. If there are no command line arguments, then the program will know to use the contents of the clipboard.
First you need to figure out what URL to use for a given street address. When you load http://maps.google.com/ in the browser and search for an address, the URL in the address bar looks something like this: https://www.google.com/maps/place/870+Valencia+St/@37.7590311,-122.4215096,17z/data=!3m1!4b1!4m2!3m1!1s0x808f7e3dadc07a37:0xc86b0b2bb93b73d8.
The address is in the URL, but thereâs a lot of additional text there as well. Websites often add extra data to URLs to help track visitors or customize sites. But if you try just going to https://www.google.com/maps/place/870+Valencia+St+San+Francisco+CA/, youâll find that it still brings up the correct page. So your program can be set to open a web browser to 'https://www.google.com/maps/place/
your_address_string'
(where your_address_string
is the address you want to map).
Step 2: Handle the Command Line Arguments
Make your code look like this:
After the programâs #!
shebang line, you need to import the webbrowser
module for launching the browser and import the sys
module for reading the potential command line arguments. The sys.argv
variable stores a list of the programâs filename and command line arguments. If this list has more than just the filename in it, then len(sys.argv)
evaluates to an integer greater than 1
, meaning that command line arguments have indeed been provided.
Command line arguments are usually separated by spaces, but in this case, you want to interpret all of the arguments as a single string. Since sys.argv
is a list of strings, you can pass it to the join()
method, which returns a single string value. You donât want the program name in this string, so instead of sys.argv
, you should pass sys.argv[1:]
to chop off the first element of the array. The final string that this expression evaluates to is stored in the address
variable.
If you run the program by entering this into the command line...
... the sys.argv
variable will contain this list value:
The address
variable will contain the string '870 Valencia St, San Francisco, CA 94110'
.
Step 3: Handle the Clipboard Content and Launch the Browser
Make your code look like the following:
If there are no command line arguments, the program will assume the address is stored on the clipboard. You can get the clipboard content with pyperclip.paste()
and store it in a variable named address
. Finally, to launch a web browser with the Google Maps URL, call webbrowser.open()
.
While some of the programs you write will perform huge tasks that save you hours, it can be just as satisfying to use a program that conveniently saves you a few seconds each time you perform a common task, such as getting a map of an address. Table 11-1 compares the steps needed to display a map with and without mapIt.py.
Table 11-1. Getting a Map with and Without mapIt.py
Manually getting a map | Using mapIt.py |
Highlight the address. | Highlight the address. |
Copy the address. | Copy the address. |
Open the web browser. | Run mapIt.py. |
Go to http://maps.google.com/. | |
Click the address text field. | |
Paste the address. | |
Press ENTER. |
See how mapIt.py makes this task less tedious?
Ideas for Similar Programs
As long as you have a URL, the webbrowser
module lets users cut out the step of opening the browser and directing themselves to a website. Other programs could use this functionality to do the following:
Open all links on a page in separate browser tabs.
Open the browser to the URL for your local weather.
Open several social network sites that you regularly check.
Downloading Files from the Web with the requests Module
The requests
module lets you easily download files from the Web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests
module doesnât come with Python, so youâll have to install it first. From the command line, run pip install requests
. (Appendix A has additional details on how to install third-party modules.)
The requests
module was written because Pythonâs urllib2
module is too complicated to use. In fact, take a permanent marker and black out this entire paragraph. Forget I ever mentioned urllib2
. If you need to download things from the Web, just use the requests
module.
Next, do a simple test to make sure the requests
module installed itself correctly. Enter the following into the interactive shell:
If no error messages show up, then the requests
module has been successfully installed.
Downloading a Web Page with the requests.get() Function
The requests.get()
function takes a string of a URL to download. By calling type()
on requests.get()
âs return value, you can see that it returns a Response
object, which contains the response that the web server gave for your request. Iâll explain the Response
object in more detail later, but for now, enter the following into the interactive shell while your computer is connected to the Internet:
The URL goes to a text web page for the entire play of Romeo and Juliet. You can tell that the request for this web page succeeded by checking the status_code
attribute of the Response
object. If it is equal to the value of requests.codes.ok
, then everything went fine âķ. (Incidentally, the status code for âOKâ in the HTTP protocol is 200. You may already be familiar with the 404 status code for âNot Found.â)
If the request succeeded, the downloaded web page is stored as a string in the Response
objectâs text
variable. This variable holds a large string of the entire play; the call to len(res.text)
shows you that it is more than 178,000 characters long. Finally, calling print(res.text[:250])
displays only the first 250 characters.
Checking for Errors
As youâve seen, the Response
object has a status_code
attribute that can be checked against requests.codes.ok
to see whether the download succeeded. A simpler way to check for success is to call the raise_for_status()
method on the Response
object. This will raise an exception if there was an error downloading the file and will do nothing if the download succeeded. Enter the following into the interactive shell:
The raise_for_status()
method is a good way to ensure that a program halts if a bad download occurs. This is a good thing: You want your program to stop as soon as some unexpected error happens. If a failed download isnât a deal breaker for your program, you can wrap the raise_for_status()
line with try
and except
statements to handle this error case without crashing.
This raise_for_status()
method call causes the program to output the following:
Always call raise_for_status()
after calling requests.get()
. You want to be sure that the download has actually worked before your program continues.
Saving Downloaded Files to the Hard Drive
From here, you can save the web page to a file on your hard drive with the standard open()
function and write()
method. There are some slight differences, though. First, you must open the file in write binary mode by passing the string 'wb'
as the second argument to open()
. Even if the page is in plaintext (such as the Romeo and Juliet text you downloaded earlier), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.
Unicode Encodings
Unicode encodings are beyond the scope of this book, but you can learn more about them from these web pages:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!): http://www.joelonsoftware.com/articles/Unicode.html
Pragmatic Unicode: http://nedbatchelder.com/text/unipain.html
To write the web page to a file, you can use a for
loop with the Response
objectâs iter_content()
method.
The iter_content()
method returns âchunksâ of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass 100000
as the argument to iter_content()
.
The file RomeoAndJuliet.txt will now exist in the current working directory. Note that while the filename on the website was rj.txt, the file on your hard drive has a different filename. The requests
module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your Internet connection after downloading the web page, all the page data would still be on your computer.
The write()
method returns the number of bytes written to the file. In the previous example, there were 100,000 bytes in the first chunk, and the remaining part of the file needed only 78,981 bytes.
To review, hereâs the complete process for downloading and saving a file:
Call
requests.get()
to download the file.Call
open()
with'wb'
to create a new file in write binary mode.Loop over the
Response
objectâsiter_content()
method.Call
write()
on each iteration to write the content to the file.Call
close()
to close the file.
Thatâs all there is to the requests
module! The for
loop and iter_content()
stuff may seem complicated compared to the open()
/write()
/close()
workflow youâve been using to write text files, but itâs to ensure that the requests
module doesnât eat up too much memory even if you download massive files. You can learn about the requests
moduleâs other features from http://requests.readthedocs.org/.
HTML
Before you pick apart web pages, youâll learn some HTML basics. Youâll also see how to access your web browserâs powerful developer tools, which will make scraping information from the Web much easier.
Resources for Learning HTML
Hypertext Markup Language (HTML) is the format that web pages are written in. This chapter assumes you have some basic experience with HTML, but if you need a beginner tutorial, I suggest one of the following sites:
A Quick Refresher
In case itâs been a while since youâve looked at any HTML, hereâs a quick overview of the basics. An HTML file is a plaintext file with the .html file extension. The text in these files is surrounded by tags, which are words enclosed in angle brackets. The tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags. For example, the following HTML will display Hello world! in the browser, with Hello in bold:
This HTML will look like Figure 11-1 in a browser.
The opening <strong>
tag says that the enclosed text will appear in bold. The closing </strong>
tags tells the browser where the end of the bold text is.
There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the <a>
tag encloses text that should be a link. The URL that the text links to is determined by the href
attribute. Hereâs an example:
This HTML will look like Figure 11-2 in a browser.
Figure 11-2. The link rendered in the browser
Some elements have an id
attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id
attribute, so figuring out an elementâs id
attribute using the browserâs developer tools is a common task in writing web scraping programs.
Viewing the Source HTML of a Web Page
Youâll need to look at the HTML source of the web pages that your programs will work with. To do this, right-click (or CTRL-click on OS X) any web page in your web browser, and select View Source or View page source to see the HTML text of the page (see Figure 11-3). This is the text your browser actually receives. The browser knows how to display, or render, the web page from this HTML.
Figure 11-3. Viewing the source of a web page
I highly recommend viewing the source HTML of some of your favorite sites. Itâs fine if you donât fully understand what you are seeing when you look at the source. You wonât need HTML mastery to write simple web scraping programsâafter all, you wonât be writing your own websites. You just need enough knowledge to pick out data from an existing site.
Opening Your Browserâs Developer Tools
Figure 11-4. The Developer Tools window in the Chrome browser
In Firefox, you can bring up the Web Developer Tools Inspector by pressing CTRL-SHIFT-C on Windows and Linux or by pressing â-OPTION-C on OS X. The layout is almost identical to Chromeâs developer tools.
After enabling or installing the developer tools in your browser, you can right-click any part of the web page and select Inspect Element from the context menu to bring up the HTML responsible for that part of the page. This will be helpful when you begin to parse HTML for your web scraping programs.
Donât Use Regular Expressions to Parse HTML
Locating a specific piece of HTML in a string seems like a perfect case for regular expressions. However, I advise you against it. There are many different ways that HTML can be formatted and still be considered valid HTML, but trying to capture all these possible variations in a regular expression can be tedious and error prone. A module developed specifically for parsing HTML, such as Beautiful Soup, will be less likely to result in bugs.
You can find an extended argument for why you shouldnât to parse HTML with regular expressions at http://stackoverflow.com/a/1732454/1893164/.
Using the Developer Tools to Find HTML Elements
Once your program has downloaded a web page using the requests
module, you will have the pageâs HTML content as a single string value. Now you need to figure out which part of the HTML corresponds to the information on the web page youâre interested in.
This is where the browserâs developer tools can help. Say you want to write a program to pull weather forecast data from http://weather.gov/. Before writing any code, do a little research. If you visit the site and search for the 94105 ZIP code, the site will take you to a page showing the forecast for that area.
What if youâre interested in scraping the temperature information for that ZIP code? Right-click where it is on the page (or CONTROL-click on OS X) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. Figure 11-5 shows the developer tools open to the HTML of the temperature.
Figure 11-5. Inspecting the element that holds the temperature text with the developer tools
From the developer tools, you can see that the HTML responsible for the temperature part of the web page is <p class="myforecast-current -lrg">59°F</p>
. This is exactly what you were looking for! It seems that the temperature information is contained inside a <p>
element with the myforecast-current-lrg
class. Now that you know what youâre looking for, the BeautifulSoup
module will help you find it in the string.
Parsing HTML with the BeautifulSoup Module
Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The BeautifulSoup
moduleâs name is bs4
(for Beautiful Soup, version 4). To install it, you will need to run pip install beautifulsoup4
from the command line. (Check out Appendix A for instructions on installing third-party modules.) While beautifulsoup4
is the name used for installation, to import Beautiful Soup you run import bs4
.
For this chapter, the Beautiful Soup examples will parse (that is, analyze and identify the parts of) an HTML file on the hard drive. Open a new file editor window in IDLE, enter the following, and save it as example.html. Alternatively, download it from http://nostarch.com/automatestuff/.
As you can see, even a simple HTML file involves many different tags and attributes, and matters quickly get confusing with complex websites. Thankfully, Beautiful Soup makes working with HTML much easier.
Creating a BeautifulSoup Object from HTML
The bs4.BeautifulSoup()
function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup()
function returns is a BeautifulSoup
object. Enter the following into the interactive shell while your computer is connected to the Internet:
This code uses requests.get()
to download the main page from the No Starch Press website and then passes the text
attribute of the response to bs4.BeautifulSoup()
. The BeautifulSoup
object that it returns is stored in a variable named noStarchSoup
.
You can also load an HTML file from your hard drive by passing a File
object to bs4.BeautifulSoup()
. Enter the following into the interactive shell (make sure the example.html file is in the working directory):
Once you have a BeautifulSoup
object, you can use its methods to locate specific parts of an HTML document.
Finding an Element with the select() Method
You can retrieve a web page element from a BeautifulSoup
object by calling the select()
method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.
A full discussion of CSS selector syntax is beyond the scope of this book (thereâs a good selector tutorial in the resources at http://nostarch.com/automatestuff/), but hereâs a short introduction to selectors. Table 11-2 shows examples of the most common CSS selector patterns.
Table 11-2. Examples of CSS Selectors
Selector passed to the | Will match... |
| All elements named |
| The element with an |
| All elements that use a CSS |
| All elements named |
| All elements named |
| All elements named |
| All elements named |
The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author')
will match any element that has an id
attribute of author
, as long as it is also inside a <p>
element.
The select()
method will return a list of Tag
objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag
object for every match in the BeautifulSoup
objectâs HTML. Tag values can be passed to the str()
function to show the HTML tags they represent. Tag values also have an attrs
attribute that shows all the HTML attributes of the tag as a dictionary. Using the example.html file from earlier, enter the following into the interactive shell:
This code will pull the element with id="author"
out of our example HTML. We use select('#author')
to return a list of all the elements with id="author"
. We store this list of Tag
objects in the variable elems
, and len(elems)
tells us there is one Tag
object in the list; there was one match. Calling getText()
on the element returns the elementâs text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'
.
Passing the element to str()
returns a string with the starting and closing tags and the elementâs text. Finally, attrs
gives us a dictionary with the elementâs attribute, 'id'
, and the value of the id
attribute, 'author'
.
You can also pull all the <p>
elements from the BeautifulSoup
object. Enter this into the interactive shell:
This time, select()
gives us a list of three matches, which we store in pElems
. Using str()
on pElems[0]
, pElems[1]
, and pElems[2]
shows you each element as a string, and using getText()
on each element shows you its text.
Getting Data from an Elementâs Attributes
The get()
method for Tag
objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attributeâs value. Using example.html, enter the following into the interactive shell:
Here we use select()
to find any <span>
elements and then store the first matched element in spanElem
. Passing the attribute name 'id'
to get()
returns the attributeâs value, 'author'
.
Project: âIâm Feeling Luckyâ Google Search
Whenever I search a topic on Google, I donât look at just one search result at a time. By middle-clicking a search result link (or clicking while holding CTRL), I open the first several links in a bunch of new tabs to read later. I search Google often enough that this workflowâopening my browser, searching for a topic, and middle-clicking several links one by oneâis tedious. It would be nice if I could simply type a search term on the command line and have my computer automatically open a browser with all the top search results in new tabs. Letâs write a script to do this.
This is what your program does:
Gets search keywords from the command line arguments.
Retrieves the search results page.
Opens a browser tab for each result.
This means your code will need to do the following:
Read the command line arguments from
sys.argv
.Fetch the search result page with the
requests
module.Find the links to each search result.
Call the
webbrowser.open()
function to open the web browser.
Open a new file editor window and save it as lucky.py.
Step 1: Get the Command Line Arguments and Request the Search Page
Before coding anything, you first need to know the URL of the search result page. By looking at the browserâs address bar after doing a Google search, you can see that the result page has a URL like https://www.google.com/search?q=SEARCH_TERM_HERE. The requests
module can download this page and then you can use Beautiful Soup to find the search result links in the HTML. Finally, youâll use the webbrowser
module to open those links in browser tabs.
Make your code look like the following:
The user will specify the search terms using command line arguments when they launch the program. These arguments will be stored as strings in a list in sys.argv
.
Step 2: Find All the Results
Now you need to use Beautiful Soup to extract the top search result links from your downloaded HTML. But how do you figure out the right selector for the job? For example, you canât just search for all <a>
tags, because there are lots of links you donât care about in the HTML. Instead, you must inspect the search result page with the browserâs developer tools to try to find a selector that will pick out only the links you want.
After doing a Google search for Beautiful Soup, you can open the browserâs developer tools and inspect some of the link elements on the page. They look incredibly complicated, something like this: <a href="/url?sa =t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8& amp;ved=0CCgQFjAA&url=http%3A%2F%2Fwww.crummy.com%2Fsoftware%2FBeautifulSoup %2F&ei=LHBVU_XDD9KVyAShmYDwCw&usg=AFQjCNHAxwplurFOBqg5cehWQEVKi-TuLQ&a mp;sig2=sdZu6WVlBlVSDrwhtworMA" onmousedown="return rwt(this,'','','','1','AFQ jCNHAxwplurFOBqg5cehWQEVKi-TuLQ','sdZu6WVlBlVSDrwhtworMA','0CCgQFjAA','','',ev ent)" data-href="http://www.crummy.com/software/BeautifulSoup/"><em>Beautiful Soup</em>: We called him Tortoise because he taught us.</a>
.
It doesnât matter that the element looks incredibly complicated. You just need to find the pattern that all the search result links have. But this <a>
element doesnât have anything that easily distinguishes it from the nonsearch result <a>
elements on the page.
Make your code look like the following:
If you look up a little from the <a>
element, though, there is an element like this: <h3 class="r">
. Looking through the rest of the HTML source, it looks like the r
class is used only for search result links. You donât have to know what the CSS class r
is or what it does. Youâre just going to use it as a marker for the <a>
element you are looking for. You can create a BeautifulSoup
object from the downloaded pageâs HTML text and then use the selector '.r a'
to find all <a>
elements that are within an element that has the r
CSS class.
Step 3: Open Web Browsers for Each Result
Finally, weâll tell the program to open web browser tabs for our results. Add the following to the end of your program:
By default, you open the first five search results in new tabs using the webbrowser
module. However, the user may have searched for something that turned up fewer than five results. The soup.select()
call returns a list of all the elements that matched your '.r a'
selector, so the number of tabs you want to open is either 5
or the length of this list (whichever is smaller).
The built-in Python function min()
returns the smallest of the integer or float arguments it is passed. (There is also a built-in max()
function that returns the largest argument it is passed.) You can use min()
to find out whether there are fewer than five links in the list and store the number of links to open in a variable named numOpen
. Then you can run through a for
loop by calling range(numOpen)
.
On each iteration of the loop, you use webbrowser.open()
to open a new tab in the web browser. Note that the href
attributeâs value in the returned <a>
elements do not have the initial http://google.com
part, so you have to concatenate that to the href
attributeâs string value.
Now you can instantly open the first five Google results for, say, Python programming tutorials by running lucky python programming tutorials
on the command line! (See Appendix B for how to easily run programs on your operating system.)
Ideas for Similar Programs
The benefit of tabbed browsing is that you can easily open links in new tabs to peruse later. A program that automatically opens several links at once can be a nice shortcut to do the following:
Open all the product pages after searching a shopping site such as Amazon
Open all the links to reviews for a single product
Open the result links to photos after performing a search on a photo site such as Flickr or Imgur
Project: Downloading All XKCD Comics
Blogs and other regularly updating websites usually have a front page with the most recent post as well as a Previous button on the page that takes you to the previous post. Then that post will also have a Previous button, and so on, creating a trail from the most recent page to the first post on the site. If you wanted a copy of the siteâs content to read when youâre not online, you could manually navigate over every page and save each one. But this is pretty boring work, so letâs write a program to do it instead.
XKCD is a popular geek webcomic with a website that fits this structure (see Figure 11-6). The front page at http://xkcd.com/ has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.
Hereâs what your program does:
Loads the XKCD home page.
Saves the comic image on that page.
Follows the Previous Comic link.
Repeats until it reaches the first comic.
Figure 11-6. XKCD, âa webcomic of romance, sarcasm, math, and languageâ
This means your code will need to do the following:
Download pages with the
requests
module.Find the URL of the comic image for a page using Beautiful Soup.
Download and save the comic image to the hard drive with
iter_content()
.Find the URL of the Previous Comic link, and repeat.
Open a new file editor window and save it as downloadXkcd.py.
Step 1: Design the Program
If you open the browserâs developer tools and inspect the elements on the page, youâll find the following:
The URL of the comicâs image file is given by the
href
attribute of an<img>
element.The
<img>
element is inside a<div id="comic">
element.The Prev button has a
rel
HTML attribute with the valueprev
.The first comicâs Prev button links to the http://xkcd.com/# URL, indicating that there are no more previous pages.
Make your code look like the following:
Youâll have a url
variable that starts with the value 'http://xkcd.com'
and repeatedly update it (in a for
loop) with the URL of the current pageâs Prev link. At every step in the loop, youâll download the comic at url
. Youâll know to end the loop when url
ends with '#'
.
You will download the image files to a folder in the current working directory named xkcd. The call os.makedirs()
ensures that this folder exists, and the exist_ok=True
keyword argument prevents the function from throwing an exception if this folder already exists. The rest of the code is just comments that outline the rest of your program.
Step 2: Download the Web Page
Letâs implement the code for downloading the page. Make your code look like the following:
First, print url
so that the user knows which URL the program is about to download; then use the requests
moduleâs request.get()
function to download it. As always, you immediately call the Response
objectâs raise_for_status()
method to throw an exception and end the program if something went wrong with the download. Otherwise, you create a BeautifulSoup
object from the text of the downloaded page.
Step 3: Find and Download the Comic Image
Make your code look like the following:
From inspecting the XKCD home page with your developer tools, you know that the <img>
element for the comic image is inside a <div>
element with the id
attribute set to comic
, so the selector '#comic img'
will get you the correct <img>
element from the BeautifulSoup
object.
A few XKCD pages have special content that isnât a simple image file. Thatâs fine; youâll just skip those. If your selector doesnât find any elements, then soup.select('#comic img')
will return a blank list. When that happens, the program can just print an error message and move on without downloading the image.
Otherwise, the selector will return a list containing one <img>
element. You can get the src
attribute from this <img>
element and pass it to requests.get()
to download the comicâs image file.
Step 4: Save the Image and Find the Previous Comic
Make your code look like the following:
At this point, the image file of the comic is stored in the res
variable. You need to write this image data to a file on the hard drive.
Youâll need a filename for the local image file to pass to open()
. The comicUrl
will have a value like 'http://imgs.xkcd.com/comics/heartbleed_explanation.png'
âwhich you might have noticed looks a lot like a file path. And in fact, you can call os.path.basename()
with comicUrl
, and it will return just the last part of the URL, 'heartbleed_explanation.png'
. You can use this as the filename when saving the image to your hard drive. You join this name with the name of your xkcd
folder using os.path.join()
so that your program uses backslashes (\
) on Windows and forward slashes (/
) on OS X and Linux. Now that you finally have the filename, you can call open()
to open a new file in 'wb'
âwrite binaryâ mode.
Remember from earlier in this chapter that to save files youâve downloaded using Requests, you need to loop over the return value of the iter_content()
method. The code in the for
loop writes out chunks of the image data (at most 100,000 bytes each) to the file and then you close the file. The image is now saved to your hard drive.
Afterward, the selector 'a[rel="prev"]'
identifies the <a>
element with the rel
attribute set to prev
, and you can use this <a>
elementâs href
attribute to get the previous comicâs URL, which gets stored in url
. Then the while
loop begins the entire download process again for this comic.
The output of this program will look like this:
This project is a good example of a program that can automatically follow links in order to scrape large amounts of data from the Web. You can learn about Beautiful Soupâs other features from its documentation at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Ideas for Similar Programs
Downloading pages and following links are the basis of many web crawling programs. Similar programs could also do the following:
Back up an entire site by following all of its links.
Copy all the messages off a web forum.
Duplicate the catalog of items for sale on an online store.
The requests
and BeautifulSoup
modules are great as long as you can figure out the URL you need to pass to requests.get()
. However, sometimes this isnât so easy to find. Or perhaps the website you want your program to navigate requires you to log in first. The selenium
module will give your programs the power to perform such sophisticated tasks.
Controlling the Browser with the selenium Module
The selenium
module lets Python directly control the browser by programmatically clicking links and filling in login information, almost as though there is a human user interacting with the page. Selenium allows you to interact with web pages in a much more advanced way than Requests and Beautiful Soup; but because it launches a web browser, it is a bit slower and hard to run in the background if, say, you just need to download some files from the Web.
Appendix A has more detailed steps on installing third-party modules.
Starting a Selenium-Controlled Browser
For these examples, youâll need the Firefox web browser. This will be the browser that you control. If you donât already have Firefox, you can download it for free from http://getfirefox.com/.
Importing the modules for Selenium is slightly tricky. Instead of import selenium
, you need to run from selenium import webdriver
. (The exact reason why the selenium
module is set up this way is beyond the scope of this book.) After that, you can launch the Firefox browser with Selenium. Enter the following into the interactive shell:
Youâll notice when webdriver.Firefox()
is called, the Firefox web browser starts up. Calling type()
on the value webdriver.Firefox()
reveals itâs of the WebDriver
data type. And calling browser.get('http://inventwithpython.com')
directs the browser to http://inventwithpython.com/. Your browser should look something like Figure 11-7.
Figure 11-7. After calling webdriver.Firefox()
and get()
in IDLE, the Firefox browser appears.
Finding Elements on the Page
WebDriver
objects have quite a few methods for finding elements on a page. They are divided into the find_element_*
and find_elements_*
methods. The find_element_*
methods return a single WebElement
object, representing the first element on the page that matches your query. The find_elements_*
methods return a list of WebElement_*
objects for every matching element on the page.
Table 11-3 shows several examples of find_element_*
and find_elements_*
methods being called on a WebDriver
object thatâs stored in the variable browser
.
Table 11-3. Seleniumâs WebDriver
Methods for Finding Elements
Method name | WebElement object/list returned |
Elements that use the CSS class | |
Elements that match the CSS | |
Elements with a matching | |
| |
| |
Elements with a matching | |
Elements with a matching tag |
Except for the *_by_tag_name()
methods, the arguments to all the methods are case sensitive. If no elements exist on the page that match what the method is looking for, the selenium
module raises a NoSuchElement
exception. If you do not want this exception to crash your program, add try
and except
statements to your code.
Once you have the WebElement
object, you can find out more about it by reading the attributes or calling the methods in Table 11-4.
Table 11-4. WebElement Attributes and Methods
Attribute or method | Description |
| The tag name, such as |
| The value for the elementâs |
| The text within the element, such as |
| For text field or text area elements, clears the text typed into it |
| Returns |
| For input elements, returns |
| For checkbox or radio button elements, returns |
| A dictionary with keys |
For example, open a new file editor and enter the following program:
Here we open Firefox and direct it to a URL. On this page, we try to find elements with the class name 'bookcover'
, and if such an element is found, we print its tag name using the tag_name
attribute. If no such element was found, we print a different message.
This program will output the following:
We found an element with the class name 'bookcover'
and the tag name 'img'
.
Clicking the Page
WebElement
objects returned from the find_element_*
and find_elements_*
methods have a click()
method that simulates a mouse click on that element. This method can be used to follow a link, make a selection on a radio button, click a Submit button, or trigger whatever else might happen when the element is clicked by the mouse. For example, enter the following into the interactive shell:
This opens Firefox to http://inventwithpython.com/, gets the WebElement
object for the <a>
element with the text Read It Online, and then simulates clicking that <a>
element. Itâs just like if you clicked the link yourself; the browser then follows that link.
Filling Out and Submitting Forms
Sending keystrokes to text fields on a web page is a matter of finding the <input>
or <textarea>
element for that text field and then calling the send_keys()
method. For example, enter the following into the interactive shell:
As long as Gmail hasnât changed the id
of the Username and Password text fields since this book was published, the previous code will fill in those text fields with the provided text. (You can always use the browserâs inspector to verify the id
.) Calling the submit()
method on any element will have the same result as clicking the Submit button for the form that element is in. (You could have just as easily called emailElem.submit()
, and the code would have done the same thing.)
Sending Special Keys
Selenium has a module for keyboard keys that are impossible to type into a string value, which function much like escape characters. These values are stored in attributes in the selenium.webdriver.common.keys
module. Since that is such a long module name, itâs much easier to run from selenium.webdriver.common.keys import Keys
at the top of your program; if you do, then you can simply write Keys
anywhere youâd normally have to write selenium.webdriver.common.keys
. Table 11-5 lists the commonly used Keys
variables.
Table 11-5. Commonly Used Variables in the selenium.webdriver.common.keys
Module
Attributes | Meanings |
| The keyboard arrow keys |
| The ENTER and RETURN keys |
| The |
| The ESC, BACKSPACE, and DELETE keys |
| The F1 to F12 keys at the top of the keyboard |
| The TAB key |
For example, if the cursor is not currently in a text field, pressing the HOME and END keys will scroll the browser to the top and bottom of the page, respectively. Enter the following into the interactive shell, and notice how the send_keys()
calls scroll the page:
The <html
> tag is the base tag in HTML files: The full content of the HTML file is enclosed within the <html>
and </html>
tags. Calling browser.find_element_by_tag_name('html')
is a good place to send keys to the general web page. This would be useful if, for example, new content is loaded once youâve scrolled to the bottom of the page.
Clicking Browser Buttons
Selenium can simulate clicks on various browser buttons as well through the following methods:
browser.back()
. Clicks the Back button.browser.forward()
. Clicks the Forward button.browser.refresh()
. Clicks the Refresh/Reload button.browser.quit()
. Clicks the Close Window button.
More Information on Selenium
Selenium can do much more beyond the functions described here. It can modify your browserâs cookies, take screenshots of web pages, and run custom JavaScript. To learn more about these features, you can visit the Selenium documentation at http://selenium-python.readthedocs.org/.
Summary
Most boring tasks arenât limited to the files on your computer. Being able to programmatically download web pages will extend your programs to the Internet. The requests
module makes downloading straightforward, and with some basic knowledge of HTML concepts and selectors, you can utilize the BeautifulSoup
module to parse the pages you download.
But to fully automate any web-based tasks, you need direct control of your web browser through the selenium
module. The selenium
module will allow you to log in to websites and fill out forms automatically. Since a web browser is the most common way to send and receive information over the Internet, this is a great ability to have in your programmer toolkit.
Practice Questions
Q: | 1. Briefly describe the differences between the |
Q: | 2. What type of object is returned by |
Q: | 3. What Requests method checks that the download worked? |
Q: | 4. How can you get the HTTP status code of a Requests response? |
Q: | 5. How do you save a Requests response to a file? |
Q: | 6. What is the keyboard shortcut for opening a browserâs developer tools? |
Q: | 7. How can you view (in the developer tools) the HTML of a specific element on a web page? |
Q: | 8. What is the CSS selector string that would find the element with an |
Q: | 9. What is the CSS selector string that would find the elements with a CSS class of |
Q: | 10. What is the CSS selector string that would find all the |
Q: | 11. What is the CSS selector string that would find the |
Q: | 12. Say you have a Beautiful Soup |
Q: | 13. How would you store all the attributes of a Beautiful Soup |
Q: | 14. Running |
Q: | 15. Whatâs the difference between the |
Q: | 16. What methods do Seleniumâs |
Q: | 17. You could call |
Q: | 18. How can you simulate clicking a browserâs Forward, Back, and Refresh buttons with Selenium? |
Practice Projects
For practice, write programs to do the following tasks.
Command Line Emailer
Write a program that takes an email address and string of text on the command line and then, using Selenium, logs into your email account and sends an email of the string to the provided address. (You might want to set up a separate email account for this program.)
This would be a nice way to add a notification feature to your programs. You could also write a similar program to send messages from a Facebook or Twitter account.
Image Site Downloader
Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photos, and then downloads all the resulting images. You could write a program that works with any photo site that has a search feature.
2048
2048 is a simple game where you combine tiles by sliding them up, down, left, or right with the arrow keys. You can actually get a fairly high score by repeatedly sliding in an up, right, down, and left pattern over and over again. Write a program that will open the game at https://gabrielecirulli.github.io/2048/ and keep sending up, right, down, and left keystrokes to automatically play the game.
Link Verification
Write a program that, given the URL of a web page, will attempt to download every linked page on the page. The program should flag any pages that have a 404 âNot Foundâ status code and print them out as broken links.
Reference : https://automatetheboringstuff.com/chapter11/
Last updated