Python Excel Tutorial: The Definitive Guide
Learn how to read and import Excel files in Python, write
If you are just getting started and would like to learn about working with data in Python, take DataCamp's interactive course, Importing Data in Python to work with CSV and Excel files in Python.
Using Python And Excel For Data Science
Excel is a spreadsheet application that was developed by Microsoft in the Year 1987. It is officially supported by almost all of the operating systems like Windows, Macintosh, Android, etc. It comes pre-installed with the Windows OS and can be easily integrated with other OS platforms. Microsoft Excel is the best and the most accessible tool when it comes to working with structured data.
It organizes, analyzes, and stores your data in tabular row-column fashion. You can perform calculations and create pivot tables, graphs, and a lot more! Since its release, this software gained popularity and is widely used in many different application fields and all sorts of domains across the world.
Since the day internet was created, it has grown exponentially, and so has the amount of data. The growth in data has pushed the need for people to understand how to analyze it. Corporations and governments were collecting big data. Hence, the term data science was coined.
When working with data, you'll need to deal with spreadsheets at some point; however, working directly with spreadsheets can get annoying at times, especially when you are a developer. To get rid of this problem, Python developers came up with ways of reading, writing, analyzing all kinds of file formats, including spreadsheets.
Today’s tutorial will be mostly on how you can use the Python programming language and work with Excel without directly using the Microsoft Excel application. It will provide you hands-on experience with the packages that you can use to load, read, write, and analyze these spreadsheets with the help of Python. You will be dealing with packages like pandas, openpyxl, xlrd, xlutils, and pyexcel.
The Data is the Oil
When you start any project that directly or indirectly deals with data, the first and foremost thing you would do is search for a dataset. Now gathering data could be done in various ways, either using web scraping, a private dataset from a client, or a public dataset downloaded from sources like GitHub
, universities
, kaggle
, quandl
, etc.
This data might be in an Excel file or saved with .csv
, .txt
, JSON
, etc. file extension. The data could be qualitative or quantitative. The data type could vary depending on the kind of problem you plan to solve. Hence, as a first step, you should figure out whether you're working with qualitative or quantitative data.
The data could be:
Continuous
Discrete
Categorical - Binary, Unordered, Ordered
Pixels of images, etc.
Best Practices For Spreadsheet Data
Before you start with loading, reading, and analyzing your excel data in Python, it is a good practice to view the sample data and understand whether the following points are in line with the file you plan to work with:
The first row of the spreadsheet is usually reserved for the header, which describes what each column's data represents unless the data in the spreadsheet is pixels of images.
Avoid names or values field header with blank spaces or names comprising of multiple words having gaps or spaces between them. Consider using Python's standard PEP-8 format like:
Underscores,
Dashes,
Camel case, where the first letter of each section of text is capitalized, or
Prefer using short names instead of long names or sentences,
Try to avoid using names that contain special characters such as
?
,$
,%
,^
, etc. since special characters do not tell anything about the data.Your data might have missing values in some columns. Make sure to fill those with NA or fill them with the mean or median of the complete column.
While working with Microsoft Excel, you’ll find a considerable amount of options to save your file. Besides the default extension .xls
or .xlsx
, you can go to the “File” tab, click on “Save As” and select one of the extensions that are listed as the “Save as Type” file extension options. The most commonly used extensions to save datasets for data science are .csv
and .txt
(as tab-delimited text file) and even .xml
. Depending on the saving option that you choose, your data set’s fields are separated by tabs or commas, which will make up the “field separator characters” of your data set.
Knowing the extension of your file is important since when you load the data stored in excel, your Python library would need to explicitly know whether it is a comma-separated or tab-separated file.
Below are all of the file extensions that MS excel supports:
Preparing Your Workspace (Optional)
Preparing your workspace is good to have a step in your pipeline, but it is not a mandatory step and can be skipped. But having this as a first step in your pipeline makes life simpler and ensures that you start well.
Verify that your working directory is the same as your Python codebase directory.
When you're working in the terminal, you might first navigate to the directory that your file is located in and then start up Python. That also means you have to make sure your file is located in the directory that you want to work from!
But someone who's a beginner and has already started your Python session and you’ve got no clue of the directory that you’re working in; you should consider executing the following commands:
Another way could be if you keep track of where your dataset file is kept. You also could just give the absolute path of that folder in your code instead of changing the directory for where you plan to write the Python code. The absolute path will ensure that no matter where you write the Python code, it will be able to fetch the data for you!
You'll see that these commands are pretty vital, not only for loading your data but also for further analysis. For now, let's just continue. You have gone through all the checkups, you have saved your data, and prepared your workspace.
Install Packages to Read and Write Excel Files
Make sure you have pip
and setuptools
installed on your system. Don't use Python 2 as it has been discontinued and make sure you have Python 3 >=3.4 installed, you won’t need to worry because then you’ll normally already have it ready. If you already have Python3, just make sure you have upgraded to the latest version.
If you do not have Python installed on your system, then feel free to check out this tutorial.
Check whether your pip or pip3 command is symbolically linked to Python3, use the one which is linked to the current version of Python (>=3.4) you plan to use in this tutorial. Also, check by typing Python
in the terminal what version it shows is it >=2.7 or >=3.4, if it is 2.7, then check by typing Python3
, if this works, then it means that you have two different Python version installed on your system.
To do this, run the following command in your terminal:
In case you haven’t installed pip
yet, run the python get-pip.py
script that you can find here. You can also follow the installation instructions on the page if you need more help to get everything running properly. You can even check this link to install pip in case the first link doesn't work for you.
Installing Anaconda is the way to go!
Anaconda Python Distribution is probably what you should be looking for because it comes bundled with almost everything that you would need to start your data science journey. Starting from Python, Pip, Pandas, Numpy, Matplotlib, etc. everything will be installed inside it. This would provide you an easy and quick way to get started with doing data science because you won’t need to worry about separately installing the packages that you need for doing data science. However, there would still be a lot of packages that might not be covered by Anaconda, which you could install manually via Pip or build from source.
Anaconda is useful for not just beginners but also useful and heavily considered by seasoned developers. It's a way to quickly test out some proof-of-concept without having to install each package separately, which saves a lot of time.
Anaconda includes 100 of the most popular Python, R, and Scala packages for data science and several open source development environments such as Jupyter Lab/Notebook and Spyder IDE. If you’d like to start working with Jupyter Notebook after this tutorial, go to this page.
You can go here to install Anaconda. To learn how to install Anaconda, check out the documentation. Follow the instructions to install, and you’re ready to start!
Congratulations, your environment has been setup!
You are all set to start loading your files and analyzing them.
Load Excel Files As Pandas DataFrames
Pandas package is one of the best ways that you could often use to import your dataset and represent it in a tabular row-column format. The Pandas library is built on top of Numerical Python popularly known as NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language. Pandas have built-in functions that could be used to analyze and plot your data and make sense of it!
Because of the power and flexibility this library provides, it has become the first choice of every data scientist. Of course, there are some disadvantages of this library; especially when dealing with big datasets, it can be slower in loading, reading, and analyzing big datasets with millions of records.
If you already have Pandas available through Anaconda, you can just load your excel file in Pandas DataFrames with pd.Excelfile()
function as shown below:
Just create a dummy example.xlsx
file and fill in some arbitrary values in rows and columns and save it in a .xlsx
format.
If you didn’t install Anaconda, you might get a no module
error. Just execute pip install pandas
on a terminal or !pip install pandas
on jupyter notebook cell to install the Pandas package in your environment and then execute the commands included in the code chunk above.
It's so simple, right?
To read in .csv
files, you have a similar function to load the data in a DataFrame: read_csv()
. Here’s an example of how you can use this function:
The pd.read_csv()
function has a sep
argument which acts as a delimiter that this function will take into account is a comma or a tab, by default it is set to a comma, but you can specify an alternative delimiter if you want to. Go to the documentation to find out which other arguments you can specify to import your file and load it successfully.
How To Write Pandas DataFrames to Excel Files
Since you load and read the files with .csv
or .xlsx
file format in Pandas, similarly, you can save the pandas data frames either as an excel file with a .xlsx
extension or as a .csv
file. Let’s say that after data analysis and machine learning predictions, you want to write the updated data or result back to a new file. You can achieve that using the pandas to_excel()
function.
But, before you use this function, make sure that you have the XlsxWriter
installed if you want to write your data to multiple worksheets in a .xlsx
file, as shown below:
Let's breakdown the above code chunk and understand it step by step:
You first define the
writer
in which the data frame output will be saved using anExcelWriter
object to output the DataFrame.The
pd.ExcelWriter
function takes two arguments, the filename and the engine being thexlsxwriter
.Next, you pass in the
writer
variable to theto_excel()
function, and you also specify the sheet name. This way, you add a sheet with the data to an existing workbook, which could have many worksheets in a workbook: you can use theExcelWriter
to save multiple, different DataFrames to one workbook having multiple sheets.
A much better and a simple option is to write data in .csv
extension. As you saw above how you can read the .csv
file using read_csv
, you can also write the data frame results back to a comma-separated file using the pandas to_csv()
method as shown below:
If you want to save the output in a tab-separated fashion, all you need to do is pass a \t
to the sep
argument. Note that there are various other functions and ways that you can use to write your files. You can even pass the header
and index
argument to the to_csv
function. You can find all of them here.
Using Conda Environment
The general advice for installing these packages is to do it in a Python or Anaconda virtualenv
without system packages. The benefit of installing packages inside a virtual environment is that it doesn't upgrade or downgrade base system packages, and you could have different conda environments for different projects.
To start working with virtualenv, you first need to install it. Installing a virtual environment is very simple, especially with Anaconda. On your base, anaconda simply creates the virtual environment with a name and the python version you want it to use. Just activate it, install whatever packages you need, and go to your project folder.
Tip: don’t forget to deactivate the environment when you’re done!
Having virtual environments makes life very simple. Imagine as a developer, you will be working on multiple different projects, and each project might need a different package with different versions. When your projects have conflicting requirements, then the virtual environment will come in handy!
Otherwise, you would keep circling on installing one package, then upgrading it for a project and downgrading it for another. A much better idea would be to have different environments for every project.
Now you can finally start installing and importing the packages that you have read about to load in your spreadsheet data.
How To Read and Write Excel Files with Openpyxl
Openpyxl package is recommended if you want to read and write .xlsx
, xlsm
, xltx
, and xltm
file formats.
You can install openpyxl
using pip
but inside the excel
conda environment, as shown in the below code cell.
You can read more about openpyxl
here.
Now that you have installed openpyxl
, you can start loading in the data.
But before you load in the data, you would need to create it.
The load_workbook()
function takes the filename as an argument and returns a workbook
object wb
, which represents the file. You can check the type of wb
by running type(wb)
.
You see that the code chunk above returns the sheet names of the workbook that you loaded in Python. Next, you can use this information also to retrieve separate sheets of the workbook.
You can also check which sheet is currently active with wb.active
. As you can see in the code below, you can also load another sheet from your workbook:
Though you'll think that these Worksheet
objects are of no use at first, you can do a lot of stuff with these. Like you can retrieve values from specific cells in your workbook's sheet by using square brackets []
, and in these square brackets, you would pass the exact cell from which you want to retrieve the value.
This way of extracting values from a cell is quite similar in spirit to selecting and extracting values from NumPy arrays and Pandas data frame via index positions. But with Openpyxl, you need to specify the attribute .value
apart from just specifying the index from where you want to extract the value as shown below:
As you can see, besides the value
attribute, there are other attributes that you can use to inspect your cell like row
, column
, and coordinate
.
From the above code cell output when selecting the B3
element from sheet1:
The
row
attribute gives3
;Adding the
column
attribute gives2
, andThe
coordinate
of the cell givesB3
.
This was information about the cells, now what if you want to retrieve the cell values?
You can retrieve cell values by using the sheet.cell()
function. All you need to do is pass the row
and the column
arguments and add the attribute .value
in the end, as shown below:
To extract values continuously instead of manually selecting the row and column index, you can use the for
loop along with the help of the range()
function.
This would give you a lot of flexibility in terms of extracting the cell values without much hardcoding. Let's print out the values of the rows that have values in column 2. If those particular cells are empty, you’ll just get back None
.
If you want to know more about for loops, consider taking our Intermediate Python for Data Science course.
The openpyxl
has a utility
class that has two methods get_column_letter
and column_index_from_string
. As the name suggests, the former returns the letter given the number/integer and the latter returns the number provided a letter as a string.
You can see how it works below:
You have already retrieved values for rows with values in a particular column, but what do you need to do if you want to print out the rows of your file without just focusing on one column?
You use another for loop, of course!
You say, for example, that you want to focus on the area between A1
and C3
, where the first specifies the left upper corner and the second in the right bottom corner of the area on which you want to focus.
This area will be the so-called cellObj
that you see in the first line of code below. You then say for each cell that lies in that area; you print the coordinate and the value that is contained within that cell. After the end of each row, you'll print a message that signals the row of cellObj
area has been printed.
Note again how the selection of the area is very similar to selecting, getting and indexing list, and NumPy array elements, where you also use square brackets and a colon :
to indicate the area you want to get the values. In addition, the above loop also makes good use of the cell attributes!
To make the above explanation and code visual, you might want to check out the result that you’ll get back once the loop has finished:
Lastly, there are some attributes that you can use to check up on the result of your import, namely max_row
and max_column
. These attributes are, of course, general ways of making sure that you loaded in the data correctly, but nonetheless, they can and will be useful.
Great, so until now, you saw how you could read data and retrieve it using openpyxl in Python. A lot of you might have felt that this is an awfully hard way to work with these files and when you haven't even looked at how you can manipulate the data, which definitely could be even more convoluted.
But don't worry, there is a much easier way!
DataFrames to the Rescue.
You can use the DataFrame()
function from the Pandas package to put the values of a sheet into a DataFrame and then use all the data frame functions to analyze and manipulate your data:
If you want to specify headers and index, you can pass a header argument with a list of headers and index as True, however, since the sheet you have converted to data frame already has headers you do not need to add headers:
You can even append or write the values back to the Excel files as shown below with the help of dataframe_to_rows
method you pass the data frame df
you had created above along with the index and header:
The openpyxl
package offers you high flexibility on how you want to write your data back to Excel files. It allows you to change cell styles and whatnot, which makes it one of those packages that you need to know while working with spreadsheets.
Note: To know more about openpyxl
like how you can change cell styles or how the package works with NumPy and Pandas check this out.
Another such package which is useful when dealing with spreadsheets is xlrd
. Let's check that out!
Reading And Formatting Excel Files: xlrd
This package is ideal if you want to read and manipulate the data from files with the .xls
or .xlsx
extension.
xlrd
provides you functions that you can use to retrieve or filter only a specific sheet and not the whole workbook. It provides you functions such as sheet_by_name()
or sheet_by_index()
to retrieve the sheets that you want to use in your analysis and filter the rest.
Writing Your Data to Excel Files with xlwt
Like other Excel Python packages, you can use xlwt
to create spreadsheets that have your data in them even manually. You can also use the xlwt
package, apart from the XlsxWriter
package. xlwt
is ideal for writing data and format information to files with older extensions like .xls
.
At first go, you will hardly find much difference in how it is better than the previous Excel packages that you learned about, but it is more to do with how much comfortable you feel while working with this package compared to the others.
Let's understand it with the help of an example wherein you will manually create a workbook using Python code and write data to it:
Automatizing the Data Writing Process
Automatizing the Data Writing Process in an excel file is essential, especially when you want to write data to the file but at the same time don't want to spend time manually entering the data to the file. In such scenarios, you could automate the whole pipeline using very simple techniques like a for
loop.
Let's understand how it can be achieved:
Let's break the above code and understand it step-by-step:
You first initialize a workbook using
xlwt.workbook()
;Then you add a sheet to the workbook with a name
Sheet1
;Then you define the data, i.e., the header (cols) and the rows (txt);
Next, you have a for loop that will iterate over the data and fill all the values into the file:
For every element ranging from 0 to 4, you are going to fill in the values row by row.
You specify a
row
element that goes to the next row at every loop increment.Next, you have another for loop that for each row goes over all the columns of your sheet.
You’ll fill in a value for every column in that row.
When you have filled all the columns of each row with values, you’ll go to the next row until you have zero rows left.
The output of the above code is shown below:
Using pyexcel To Read .xls or .xlsx Files
pyexcel
is a Python Wrapper that provides a single API interface for reading, manipulating, and writing data in .csv
, .ods
, .xls
, .xlsx
, and .xlsm
files. With pyexcel
, the data in excel files can be turned into an array or dict format with minimal code.
Below is an example of how you can convert your excel data into an array format using get_array()
that is a function within the pyexcel
package:
Let's find out how you can convert your excel data into an ordered dictionary of lists. To achieve this you can use the get_dict()
function, and it also comes within the pyexcel
package:
You can also get a dictionary of two-dimensional arrays. To put it simply, you can extract all the workbook sheets in a single dictionary with the help of the get_book_dict()
function.
To make your life easier, remember that the above two outputs, my_dict
, and book_dict
, can be converted to a DataFrame using pd.DataFrame()
. This will make it easier for you to handle your data!
To learn how you can efficiently work with Python lists, check out DataCamp's 18 Most Common Python List Questions.
Writing Files With pyexcel
Just like it’s easy to load your data into arrays with this package, you can also easily export your arrays back to a spreadsheet. You can obtain this using the save_as()
function and pass the array and the destination file name to the dest_file_name
argument as shown below:
Note that if you want to specify a delimiter, you can add the dest_delimiter
argument and pass the symbol that you want to use as a delimiter in between like \t
, ,
, ""
.
If, however, you have a dictionary, you’ll need to use the save_book_as()
function. Pass the two-dimensional dictionary to bookdict
and specify the file name and you’re good to go:
Note that the order of your data in the dictionary will not be kept in the above code. If you don’t want this, you will need to make a small modification. You can read all about it here.
Reading and Writing .csv files
Python has an enormous amount of packages for achieving similar tasks with a different set of libraries. Hence, if you’re still looking for packages that allow you to load, read and write data to .csv
files besides Excel packages and Pandas, you can consider using the CSV
package as shown in the below code cell:
Final Check of the Data
When you have the data available, it is generally recommended to check whether the data has been loaded correctly. If you have put your data in a DataFrame, you can easily and quickly check whether the data has been loaded as expected by running head()
and tail()
functions. The head()
will output the first few rows of the data frame while the tail()
will output the last few rows of the data frame.
Tip: make use of DataCamp’s Pandas Cheat Sheet this would come in handy when you are loading files as Pandas DataFrames. For more guidance on how to manipulate Python DataFrames, take our Pandas Tutorial: DataFrames in Python.
Let's check the shape, dimensions, and the data type of the data frame data
:
If you want to know more about how you can leverage NumPy arrays for data analysis, consider going through our NumPy tutorial, and don’t forget to use our Golden NumPy cheat sheet!
Conclusion
Congratulations on finishing this tutorial!
You have successfully gone through our tutorial that taught you all about how to read Excel files in Python.
But importing data is just the start of your data science workflow. Once you have the data from your spreadsheets in your environment, you can focus on what matters: analyzing your data. If you have loaded your data in DataFrames, consider taking our Pandas Foundations course or Manipulating DataFrames with Pandas courses.
However, if you want to continue working on this topic, consider checking out PyXll, which enables you to write functions in Python and call them in Excel.
Reference : https://www.datacamp.com/community/tutorials/python-excel-tutorial
Last updated