Working With Text Data

Pandas Working With Text Data

Lowercasing and Uppercasing a Data

Series and Indexes are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally, have names matching the equivalent (scalar) built-in string methods.

In order to lowercase a data, we use str.lower() this function converts all uppercase characters to lowercase. If no uppercase characters exist, it returns the original string. In order to uppercase a data, we use str.upper() this function converts all lowercase characters to uppercase. If no lowercase characters exist, it returns the original string. Code #1:

# Import pandas package 
import pandas as pd 
   
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
   
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
   
# converting and overwriting values in column 
df["Name"]= df["Name"].str.lower()
 
print(df)

In this example, we are using nba.csv file. Code #2:

# importing pandas package 
import pandas as pd 
   
# making data frame from csv file 
data = pd.read_csv("nba.csv") 
   
# converting and overwriting values in column 
data["Team"]= data["Team"].str.upper() 
   
# display 
data 

Output : As shown in the output image of data frame, all values in the Team column have been converted into upper case.

Splitting and Replacing a Data

In order to split a data, we use str.split() this function returns a list of strings after breaking the given string by the specified separator but it can only be applied to an individual string. Pandas str.split() method can be applied to a whole series. .str has to be prefixed every time before calling this method to differentiate it from the Python’s default function otherwise, it will throw an error. In order to replace a data, we use str.replace() this function works like Python .replace() method only, but it works on Series too. Before calling .replace() on a Pandas series, .str has to be prefixed in order to differentiate it from the Python’s default replace method. Code #1

# importing pandas module  
import pandas as pd 
     
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Knnuaj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
    
# dropping null value columns to avoid errors 
df.dropna(inplace = True) 
    
# new data frame with split value columns 
df["Address"]= df["Address"].str.split("a", n = 1, expand = True) 
   
# df display 
print(df)

Code #2:

# importing pandas module 
import pandas as pd
 
# reading csv file from url
data = pd.read_csv("nba.csv")
 
# overwriting column with replaced value of age
data["Age"]= data["Age"].replace(25.0, "Twenty five")
 
# creating a filter for age column 
# where age = "Twenty five"
filter = data["Age"]=="Twenty five"
 
# printing only filtered columns
data.where(filter).dropna()

Output : As shown in the output image, all the values in Age column having age=25.0 have been replaced by “Twenty five”.

Concatenation of Data

In order to concatenate a Series or Index, we use str.cat() this function is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. .str has to be prefixed to differentiate it from the Python’s default method. Code #1:

# importing pandas module 
import pandas as pd 
   
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
 
# making copy of address column 
new = df["Address"].copy() 
   
# concatenating address with name column 
# overwriting name column 
df["Name"]= df["Name"].str.cat(new, sep =", ") 
   
# display 
print(df)
# importing pandas module
import pandas as pd
 
# importing csv from link
data = pd.read_csv("nba.csv")
 
# making copy of team column
new = data["Team"].copy()
 
# concatenating team with name column
# overwriting name column
data["Name"]= data["Name"].str.cat(new, sep =", ")
 
# display
data

Output: As shown in the output image, every string in the Team column having same index as string in Name column have been concatenated with separator “, “.

Removing Whitespaces of Data

In order to remove a whitespaces, we use str.strip(), str.rstrip(), str.lstrip() these function used to handle white spaces(including New line) in any text data. As it can be seen in the name, str.lstrip() is used to remove spaces from the left side of string, str.rstrip() to remove spaces from right side of the string and str.strip() removes spaces from both sides. Since these are pandas function with same name as Python’s default functions, .str has to be prefixed to tell the compiler that a Pandas function is being called. Code #1:

# importing pandas module 
import pandas as pd 
   
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur junction', 'Kanpur junction', 
                   'Nagpur junction', 'Kannuaj junction'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data)
   
# replacing address name and adding spaces in start and end 
new = df["Address"].replace("Nagpur junction", "  Nagpur junction  ").copy() 
   
# checking with custom string 
print(new.str.strip()==" Nagpur junction")
print(new.str.strip()=="Nagpur junction ")
print(new.str.strip()==" Nagpur junction ")
# importing pandas module 
import pandas as pd 
   
# making data frame 
data = pd.read_csv("nba.csv") 
   
# replacing team name and adding spaces in start and end 
new = data["Team"].replace("Boston Celtics", "  Boston Celtics  ").copy() 
   
# checking with custom removed space string 
new.str.lstrip()=="Boston Celtics  "

Reference : https://www.geeksforgeeks.org/python-pandas-working-with-text-data/

Last updated