Pandas - My Cheatsheet
Sometimes I get just really lost with all available commands and tricks one can make on pandas. This way, I really wanted a place to gather my tricks that I really don’t want to forget.
My Pandas Cheatsheet
How to list available columns on a DataFrame
df.columns.values
How to make multiple filters
df[(df.column > value1) & (df.column < value2)]
How to iterate over a Dataframe
for item, row in df.iterrows():
print row()
How to count the ocurrences of each unique values on a Series
df[column].value_counts()
# get indexes
df[column].value_counts().index.tolist()
# get values of occurrences
df[column].value_counts().values.tolist()
How to save a DataFrame by chunks
df = pd.DataFrame([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])
df1 = df.iloc[0:2,:]
df2= df.iloc[2:,:]
df1.to_csv('./teste1.csv', index=False, header=False)
df2.to_csv('./teste1.csv', index=False, header=False, mode='a')
df_final = pd.read_csv('./teste1.csv')
df_final.head()
A groupby example
df_grouped = df.groupby(
by=['first_column', 'second_column']
)['third_column'].mean().reset_index(name='mean_values_grouped')
How to fill values on missing months
If you have a dataframe with 2 columns: year and month. But data is not available for all months, so you need to enter missing months on your dataframe with empty values on them.
# Original data with months not available
df1 = pd.DataFrame({
'month': [1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 1, 2, 3, 4,
5, 8, 9, 10, 11, 12],
'year': [2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2012, 2012, 2012, 2012, 2012, 2012,
2012, 2012, 2012, 2012],
'qty': [5, 7, 3, 6, 7, 8, 3, 5, 7, 10, 12,
5, 7, 8, 1, 3, 5, 7, 8, 20]
})
# List of all months
df2 = pd.DataFrame({'month': list(range(1,13))})
Now we create an empty dataframe with all available years and months:
from itertools import product
years_months = pd.DataFrame(list(product(np.unique(df2.month), np.unique(df1.year))), columns=['month', 'year'])
Now we can just merge both dataframes with an outer join:
pd.merge(years_months, df1, how='outer')
How to filter column elements by multiple elements contained on a list
df[df['A'].isin([3, 6])]
How to change a Series type?
import pandas as pd
serie = pd.Series([1, 2, 3, 4])
series.astype(float)
How to apply a function to every item of my Serie?
import pandas as pd
serie = pd.Series(['a', 'b', 'b', 'a'])
series.apply(lambda x: 0 if x=='a' else 1)
How to prepare my DataFrame to apply get_dummies?
import pandas as pd
X = pd.read_csv(..)
categorical = ['x1', 'x2', 'x4'] # columns that have categorical features in your X
for cat in categorical:
X[cat] = X[cat].astype(object)
X_dummy = pd.get_dummies(X)
read_csv errors of encoding
Usually you can read a csv just by doing something like:
pd.read_csv('file.csv')
Sometimes, an encoding error appears. The first option is to pass ‘utf8’ as a value of
the parameter encoding
.
pd.read_csv('file.csv', encoding='utf8')
But there are some cases where this is not enough and the following error keeps appearing:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 4: invalid continuation byte
The only thing that could resolve this was:
pd.read_csv('file.csv', encoding='latin-1')
Sum values of all columns
df.sum(axis=1)
Use apply for multiple columns
def my_function(a, b):
return a + b
df.apply(lambda row: my_function(row['a'], row['b']), axis=1)