As Python remains the language of choice for data analysis and data science, Pandas has become indispensable for handling, manipulating, and analyzing structured data efficiently. Recruiters must identify candidates skilled in data wrangling, aggregation, and transformation using Pandas to ensure accurate, high-quality insights.
This resource, "100+ Pandas Interview Questions and Answers," is tailored for recruiters to simplify the evaluation process. It covers topics from basic data manipulation to advanced data analysis and performance optimization.
Whether hiring for Data Analysts, Data Scientists, or Machine Learning Engineers, this guide enables you to assess a candidate’s:
Core Pandas Knowledge
Series
and DataFrame
, their attributes, methods, and indexing..loc
, .iloc
, boolean indexing, and conditional selection.Advanced Skills
Real-World Proficiency
For a streamlined assessment process, consider platforms like WeCP, which allow you to:
✅ Create customized Pandas assessments tailored to your data analysis and reporting needs.
✅ Include hands-on tasks, such as cleaning messy datasets, performing complex groupby operations, or generating pivot tables and insights.
✅ Proctor assessments remotely with AI-based monitoring to ensure integrity.
✅ Leverage automated grading to evaluate correctness, efficiency, and adherence to best practices in data manipulation.
Save time, ensure technical depth, and confidently hire Pandas professionals who can analyze and transform data accurately and efficiently from day one.
Pandas is an open-source data manipulation and analysis library built on top of NumPy. It provides powerful, flexible data structures that allow you to work with structured data more easily. While NumPy is primarily designed for numerical computing, Pandas adds support for labeled axes (rows and columns), making it easier to handle and analyze data in tabular form (e.g., spreadsheets or databases).
Key differences:
In short, while both are built on the same core principles, Pandas extends NumPy’s capabilities by adding more high-level data manipulation features suitable for practical data analysis.
A DataFrame is a 2-dimensional labeled data structure in Pandas, similar to a table or a spreadsheet, with rows and columns. It is one of the core data structures in Pandas and is used for storing and manipulating structured data.
A DataFrame can hold data of different types (e.g., integers, floats, strings) in each column, and each row is labeled with an index (often defaulting to integers). The columns are also labeled with headers. A DataFrame can be created from various data sources such as lists, dictionaries, NumPy arrays, and external data files like CSV or Excel.
Key features of a DataFrame:
Example of creating a simple DataFrame:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
Pandas provides two main data structures:
Example:
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
Example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
In addition to these, Pandas also supports other specialized data structures like Categorical (for handling categorical data efficiently) and DatetimeIndex (for time-based data).
Example of a Series:
import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
Example of a Data Frame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
So, the primary difference is dimensionality: a Series is one-dimensional (single column or row), while a DataFrame is two-dimensional (multiple rows and columns).
A Pandas DataFrame can be created in several ways. Some common methods include:
From a dictionary of lists or arrays:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
From a list of lists or tuples:
data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
From a NumPy array:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
From a CSV file:
df = pd.read_csv('data.csv')
From an existing Series:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
df = pd.DataFrame(s)
These methods offer flexibility in how you structure and import data into a DataFrame.
A Pandas Series can be created in several ways:
From a list or array:
import pandas as pd
s = pd.Series([1, 2, 3, 4])
From a dictionary:\
s = pd.Series({'a': 1, 'b': 2, 'c': 3})
From a NumPy array:
import numpy as np
arr = np.array([1, 2, 3])
s = pd.Series(arr)
With a custom index:
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
The index of a Series is optional, but it can be set explicitly to assign meaningful labels to the data points.
The pd.read_csv() function is used to load data from a CSV (Comma Separated Values) file into a Pandas DataFrame. It is one of the most commonly used functions for reading tabular data into Pandas. This function automatically handles parsing and data type inference, making it easy to work with data from CSV files.
Example:
import pandas as pd
df = pd.read_csv('data.csv')
You can also pass parameters to control how the CSV is read, such as specifying the delimiter (sep), handling missing values (na_values), and parsing dates (parse_dates).
To read data from an Excel file, you can use the pd.read_excel() function. This function allows you to load data from Excel files into a Pandas DataFrame. It supports both .xls and .xlsx file formats.
Example:
import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
You can also specify additional parameters like usecols (to specify which columns to load), skiprows (to skip rows at the start), and dtype (to specify data types for columns).
To check the first few rows of a DataFrame, you can use the head() method. By default, it returns the first 5 rows, but you can specify the number of rows to return by passing an argument.
Example:
df.head() # First 5 rows
df.head(10) # First 10 rows
This method is useful for quickly inspecting the content of a DataFrame to verify its structure and data.
To get basic statistical details of a DataFrame, such as mean, median, standard deviation, etc., you can use the describe() method. This function returns a summary of statistics for each numeric column, including count, mean, standard deviation, min, max, and quantiles.
Example:
df.describe()
For categorical columns, you can use df['column'].value_counts() to see the frequency distribution of each category.
In Pandas, selecting a single column from a DataFrame can be done using the column name either as a string key or as an attribute. Here are two common ways to select a single column:
Using the column name as a key:
df['column_name']
Using attribute-style access (this method works only if the column name is a valid Python identifier and does not conflict with DataFrame methods):
df.column_name
Both methods return a Pandas Series object, which is a 1D array of values from the selected column.
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
age_column = df['Age']
To select multiple columns from a DataFrame, you need to pass a list of column names inside square brackets. This returns a DataFrame consisting of only the selected columns.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
})
selected_columns = df[['Name', 'Age']]
In this example, selected_columns will be a DataFrame with the Name and Age columns.
You can filter rows in a DataFrame based on a condition by using boolean indexing. This involves applying a condition to one or more columns, which returns a boolean Series that you can use to filter the rows.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
In this example, filtered_df will contain the row where Charlie is the only one with an age greater than 30.
You can also filter using multiple conditions:
filtered_df = df[(df['Age'] > 25) & (df['Name'] == 'Bob')]
Here, the & operator is used to combine multiple conditions, and you must wrap each condition in parentheses.
The shape of a DataFrame can be obtained using the .shape attribute, which returns a tuple representing the number of rows and columns in the DataFrame.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
shape = df.shape
print(shape) # Output: (3, 2) -- 3 rows and 2 columns
Here, df.shape returns a tuple (3, 2), where 3 is the number of rows and 2 is the number of columns.
The info() method in Pandas provides a concise summary of a DataFrame. It gives the following useful information:
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
})
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
2 City 3 non-null object
dtypes: int64(1), object(2)
memory usage: 111.0+ bytes
This is especially useful for quickly understanding the structure of your data, identifying missing values, and checking column types.
Handling missing data is a common task when working with real-world data. Pandas provides several ways to handle missing values, which are represented as NaN (Not a Number).
Identifying missing data: You can use the .isna() or .isnull() methods to identify missing values. Both return a DataFrame of the same shape, with True for missing values and False for non-missing values. Example:
df.isna()
Dropping missing values: You can remove rows or columns with missing values using the .dropna() method. Example:
df.dropna() # Drops any rows with NaN values
df.dropna(axis=1) # Drops any columns with NaN values
Filling missing values: The .fillna() method allows you to fill missing values with a specific value, such as the mean or median, or with forward or backward filling. Example:
df.fillna(0) # Replace NaN with 0
df.fillna(df['Age'].mean()) # Replace NaN in 'Age' column with the mean
NaN stands for Not a Number, and it represents missing or undefined data in a Pandas DataFrame. It is the standard missing value marker in Pandas and is part of the IEEE floating point standard.
NaN is typically used in DataFrames when:
You can check for NaN values using the .isna() or .isnull() method.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, np.nan, 35]
})
df.isna() # Check for NaN values
Output:
Name Age
0 False False
1 False True
2 False False
Pandas provides several strategies for filling missing data using the .fillna() method. You can fill missing values with a constant value, or with calculated values like the mean, median, or mode of a column. You can also use forward fill (ffill) or backward fill (bfill), where missing values are replaced with the previous or next valid value, respectively.
Filling with a constant:
df.fillna(0) # Replace all NaN values with 0
Filling with the mean of a column:
df['Age'] = df['Age'].fillna(df['Age'].mean())
Forward fill (propagate previous valid value):
df.fillna(method='ffill')
Backward fill (propagate next valid value):
df.fillna(method='bfill')
To drop missing values in a Pandas DataFrame, you can use the .dropna() method. This will remove any rows or columns that contain NaN values.
Drop rows with missing values:
df.dropna() # Drops any rows with NaN values
Drop columns with missing values:
df.dropna(axis=1) # Drops any columns with NaN values
You can also specify the threshold for dropping rows or columns, for example, only drop rows with at least 2 non-null values:
df.dropna(thresh=2) # Keep rows with at least 2 non-null values
The dropna() method is used to remove missing values from a DataFrame by deleting either rows or columns that contain NaN values. This is useful when you want to clean your data by removing incomplete or missing entries.
You can specify whether to drop rows or columns, and also define a threshold for the number of non-null values required to keep a row or column.
Key parameters of dropna():
Example:
python
df.dropna(axis=0, how='any') # Drop rows with any NaN values
df.dropna(axis=1, how='all') # Drop columns with all NaN values
To sort a DataFrame in Pandas, you can use the .sort_values() method, which allows you to specify one or more columns by which to sort the data.
Sorting by a single column:
df.sort_values(by='column_name')
By default, .sort_values() sorts in ascending order. To sort in descending order, you can set the ascending parameter to False:
df.sort_values(by='column_name', ascending=False)
Sorting by multiple columns:
To sort by multiple columns, pass a list of column names to the by parameter:
df.sort_values(by=['column1', 'column2'])
You can also specify the sorting order for each column individually by passing a list to the ascending parameter:
df.sort_values(by=['column1', 'column2'], ascending=[True, False])
Example
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 25],
'Score': [85, 90, 88, 95]
})
df.sort_values(by=['Age', 'Score'], ascending=[True, False])
This will first sort by Age in ascending order and then by Score in descending order.
The iloc[] function in Pandas is used for integer-location based indexing. It allows you to select rows and columns from a DataFrame using integer indices (i.e., numerical positions).
Syntax:
df.iloc[row_index, column_index]
Examples:
Select the first row and first column:
df.iloc[0, 0]
Select the first 3 rows and all columns:
df.iloc[:3, :]
Select all rows and the first two columns:
df.iloc[:, :2]
Select a specific set of rows and columns (e.g., rows 1 and 2, columns 0 and 2):
df.iloc[[1, 2], [0, 2]]
The loc[] function in Pandas is used for label-based indexing. It allows you to select rows and columns by their label names (i.e., row and column names).
Syntax:
df.loc[row_label, column_label]
Examples:
Select a specific row by label:
df.loc[0] # Select the row with index label 0
Select a specific value in a row and column:
df.loc[0, 'Age'] # Select the 'Age' value of the row with index 0
Select multiple rows by label:
df.loc[0:2] # Select rows with labels 0, 1, and 2
Select multiple rows and columns by label:
df.loc[0:2, ['Name', 'Score']] # Select rows 0 to 2 and columns 'Name' and 'Score'
To rename columns in a DataFrame, use the rename() method. This method allows you to specify new column names by passing a dictionary of old column names to new column names.
df.rename(columns={'old_name': 'new_name'})
Example
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
df.rename(columns={'Name': 'Full Name', 'Age': 'Age Group'}, inplace=True)
The inplace=True argument modifies the original DataFrame; otherwise, it returns a new DataFrame with the updated column names.
To reset the index of a DataFrame, you can use the reset_index() method. This will move the current index to a new column and replace it with the default integer-based index.
Syntax:
df.reset_index(drop=False)
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
df = df.set_index('Name') # Setting 'Name' as the index
df.reset_index(drop=True, inplace=True) # Reset index and drop old index column
The groupby() function in Pandas is used to group a DataFrame based on one or more columns and then perform operations like aggregation, transformation, or filtering on each group.
Steps involved in groupby():
Common operations:
Example:
df = pd.DataFrame({
'Team': ['A', 'B', 'A', 'B'],
'Points': [10, 20, 15, 25]
})
grouped = df.groupby('Team').sum() # Group by 'Team' and sum 'Points'
This will group the rows by the 'Team' column and sum the 'Points' for each group.
You can filter columns in a DataFrame by applying a condition to the column values. This returns a new DataFrame with only the columns that satisfy the condition.
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Score': [85, 90, 88]
})
filtered_df = df[df['Age'] > 30] # Filter rows where Age is greater than 30
You can also filter columns (not rows) based on conditions:
filtered_columns = df.loc[:, df.mean() > 80] # Keep columns with mean > 80
To combine two DataFrames vertically (i.e., stacking them row-wise), you can use the append() method or the concat() function.
Example using append():
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
})
df2 = pd.DataFrame({
'Name': ['Charlie', 'David'],
'Age': [35, 40]
})
combined = df1.append(df2, ignore_index=True)
Example using concat():
combined = pd.concat([df1, df2], ignore_index=True)
The ignore_index=True argument ensures the index is reset after appending.
To merge two DataFrames based on a common column, you can use the merge() function, which is similar to SQL join operations (e.g., inner join, left join).
Syntax:
df1.merge(df2, on='common_column', how='inner')
Example:
python
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [2, 3, 4],
'Age': [30, 35, 40]
})
merged_df = df1.merge(df2, on='ID', how='inner')
This will merge the DataFrames based on the ID column, keeping only rows with matching IDs in both DataFrames (inner join).
The main difference between merge() and concat() in Pandas lies in how they combine DataFrames:
Example using concat():
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df_concat = pd.concat([df1, df2], axis=1)
In summary:
To check for duplicate rows in a DataFrame, you can use the duplicated() method. This method returns a boolean Series, where True indicates a duplicate row and False indicates a unique row. By default, it checks for duplicates in all columns.
Syntax:
df.duplicated()
You can also check for duplicates in specific columns by passing the column names to the subset parameter.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35]
})
duplicates = df.duplicated()
print(duplicates)
OUTPUT
0 False
1 False
2 True
3 False
dtype: bool
In this case, the row with index 2 is a duplicate of the row with index 0.
To remove duplicate rows, you can use the drop_duplicates() method. By default, this method removes all rows that are duplicates, keeping the first occurrence.
Syntax:
df.drop_duplicates()
You can specify the columns on which to check for duplicates using the subset parameter and whether to keep the first, last, or none using the keep parameter.
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35]
})
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
You can also keep the last occurrence of each duplicate:
python
df_no_duplicates_last = df.drop_duplicates(keep='last')
Or remove all duplicates:
df_no_duplicates_none = df.drop_duplicates(keep=False)
You can apply a function to a DataFrame column using the apply() method. This allows you to apply a function to each element of the column.
Syntax:
df['column_name'].apply(function)
Example
import pandas as pd
df = pd.DataFrame({
'Age': [25, 30, 35]
})
# Apply a function to square each element in the 'Age' column
df['Age_squared'] = df['Age'].apply(lambda x: x ** 2)
print(df)
Output
Age Age_squared
0 25 625
1 30 900
2 35 1225
The apply() method is very flexible and can be used to apply any custom function to the column.
The apply() function in Pandas is used to apply a function along an axis (either rows or columns) of a DataFrame or Series. It is a powerful tool for performing element-wise operations or aggregations, especially when the operation is complex and cannot be done directly with vectorized operations.
You can apply a function to:
Example (applying to rows):
df.apply(lambda row: row['Age'] * 2, axis=1)
Example (applying to columns):
df.apply(lambda col: col.max() - col.min())
In the first example, axis=1 indicates that the function is applied row-wise, while in the second example, axis=0 (default) applies the function column-wise.
You can change the datatype of a DataFrame column using the astype() method. This method allows you to specify the desired datatype for one or more columns.
Syntax:
df['column_name'] = df['column_name'].astype(new_type)
Example:
df = pd.DataFrame({
'Age': ['25', '30', '35']
})
# Change the 'Age' column from string to integer
df['Age'] = df['Age'].astype(int)
print(df)
Outpout
Age
0 25
1 30
2 35
You can also convert multiple columns at once by passing a dictionary to astype():
Example
df = df.astype({'Age': 'int64'})
The astype() method in Pandas is used to cast a column (or an entire DataFrame) to a specified data type. This is particularly useful when the data types of columns are not as expected (e.g., numeric values stored as strings).
Example:
df['Age'] = df['Age'].astype(float)
You can use astype() to convert columns to other numeric types (e.g., int, float) or to categorical types ('category'), datetime ('datetime64'), etc.
Pandas allows you to perform arithmetic operations on DataFrame columns directly. You can add, subtract, multiply, or divide columns in a straightforward manner.
Example:
python
df = pd.DataFrame({
'Age': [25, 30, 35],
'Score': [85, 90, 88]
})
# Add two columns
df['Total'] = df['Age'] + df['Score']
# Subtract two columns
df['Age_minus_Score'] = df['Age'] - df['Score']
# Multiply two columns
df['Age_times_Score'] = df['Age'] * df['Score']
# Divide two columns
df['Age_divided_by_Score'] = df['Age'] / df['Score']
These operations can also be performed with constants:
df['Age_plus_5'] = df['Age'] + 5
Pandas handles element-wise operations, so the operation is applied to each value in the columns.
To check for null values (NaN) in a DataFrame, you can use the isna() or isnull() methods. These return a DataFrame of the same shape with True for each NaN value and False otherwise.
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, None, 35]
})
df.isna()
This will return:
Name Age
0 False False
1 False True
2 False False
To check if there are any null values in the entire DataFrame, use:
python
df.isna().any().any() # Returns True if any NaN values are present
To count the number of null values per column:
df.isna().sum()
The at[] and iat[] methods are used to access individual elements in a DataFrame, but they differ in how they work:
at[]: It is used for label-based indexing to access a single value from a DataFrame. You provide the row label and column label.python
df.at[1, 'Age'] # Access the value in row 1 and column 'Age'
iat[]: It is used for integer-location based indexing. You provide the row index and column index as integers.
python
Copy code
df.iat[1, 1] # Access the value at row index 1 and column index 1
Both methods are faster than using loc[] or iloc[] when accessing a single value.
The pd.to_datetime() function is used to convert a column or series of date-like objects (e.g., strings) into Pandas datetime objects. This is useful when dealing with date data stored as strings or other formats, as it ensures that the data is properly interpreted as datetime objects, allowing you to perform datetime-related operations.
Syntax:
pd.to_datetime(data)
Example
df = pd.DataFrame({
'Date': ['2023-01-01', '2023-02-01', '2023-03-01']
})
df['Date'] = pd.to_datetime(df['Date'])
After conversion, the Date column will have the datetime64 type, enabling datetime operations like comparison, extraction of day/month/year, and more.
You can also specify the format of the date string for better performance:
python
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
You can filter rows of a DataFrame based on multiple conditions by combining conditions with logical operators (e.g., & for AND, | for OR). Each condition must be enclosed in parentheses to ensure correct precedence
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Score': [85, 90, 88, 95]
})
# Filter rows where Age is greater than 30 and Score is greater than 90
filtered_df = df[(df['Age'] > 30) & (df['Score'] > 90)]
print(filtered_df)
This will output
Name Age Score
3 David 40 95
In the example above:
Note: Use & for AND conditions and | for OR conditions. Always wrap each condition in parentheses to avoid syntax errors.
Both merge() and join() are used to combine two DataFrames, but they have different use cases and syntax:
Syntax:
df1.merge(df2, on='key', how='inner')
Exzmple
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 35]})
merged_df = df1.merge(df2, on='ID', how='inner')
Syntax:
df1.join(df2, on='key', how='left')
Example
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'Age': [25, 30, 35]}, index=[1, 2, 3])
joined_df = df1.join(df2)
Key Differences:
You can change the index of a DataFrame using the set_index() method. This method allows you to set one or more columns as the new index of the DataFrame.
Syntax:
df.set_index('column_name', inplace=True)
Example
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
# Set 'Name' as the index
df.set_index('Name', inplace=True)
print(df)
output
Age
Name
Alice 25
Bob 30
Charlie 35
You can also reset the index to the default integer-based index using reset_index():
df.reset_index(inplace=True)
The pivot_table() function in Pandas is used to create a pivot table from a DataFrame. It allows you to aggregate data and reshape the data, summarizing it based on one or more grouping columns.
Syntax:
df.pivot_table(values='column_name', index='group_column', aggfunc='mean')
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice'],
'Age': [25, 30, 35, 40, 25],
'Score': [85, 90, 88, 95, 80]
})
pivot = df.pivot_table(values='Score', index='Name', aggfunc='mean')
print(pivot)
Example
Score
Name
Alice 82.5
Bob 90.0
Charlie 88.0
David 95.0
In this example, the pivot_table() calculates the mean score for each name, grouping the data by the 'Name' column.
You can use the groupby() function followed by an aggregation function like sum(), mean(), or count() to group data and compute aggregate statistics.
Syntax:
df.groupby('group_column').agg(aggregation_function)
Example
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 25, 30, 35],
'Score': [85, 90, 88, 92, 88]
})
grouped = df.groupby('Name').agg({'Age': 'mean', 'Score': 'mean'})
print(grouped)
Output
Age Score
Name
Alice 25.0 86.5
Bob 30.0 91.0
Charlie 35.0 88.0
In this example, groupby('Name') groups the data by the 'Name' column, and agg({'Age': 'mean', 'Score': 'mean'}) computes the mean of 'Age' and 'Score' for each group.
A MultiIndex in Pandas allows you to work with multiple levels of indexing in a DataFrame. This is useful for representing hierarchical data, where each index can have multiple levels.
Creating a MultiIndex:
You can create a MultiIndex by passing a list of tuples or using the set_index() method with multiple columns.
Example:
df = pd.DataFrame({
'Region': ['North', 'North', 'South', 'South'],
'City': ['New York', 'Boston', 'Chicago', 'Miami'],
'Sales': [100, 200, 300, 400]
})
df.set_index(['Region', 'City'], inplace=True)
print(df)
Output
Sales
Region City
North New York 100
Boston 200
South Chicago 300
Miami 400
In this example, 'Region' and 'City' form a MultiIndex.
Accessing data with MultiIndex:
You can access data in a MultiIndex using .loc[]:
python
df.loc[('North', 'New York')]
You can also reset the MultiIndex using reset_index():
df.reset_index(inplace=True)
You can concatenate DataFrames along rows or columns using the concat() function.
Concatenating along rows (vertically):
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})
result = pd.concat([df1, df2], axis=0, ignore_index=True)
print(result)
Output
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
Concatenating along columns (horizontally):
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Score': [85, 90]})
result = pd.concat([df1, df2], axis=1)
print(result)
Output
Name Age Score
0 Alice 25 85
1 Bob 30 90
concat(): It is used for concatenating multiple DataFrames along a particular axis (either rows or columns). It is more flexible and can concatenate more than two DataFrames at once. You can also control how the index is handled with the ignore_index parameter.Example:
pd.concat([df1, df2], axis=0)
append(): It is used to append one DataFrame to the end of another, row-wise (axis=0). append() is essentially a shorthand for concat(), but it is less flexible and can only append one DataFrame at a time. Example:
df1.append(df2, ignore_index=True)
Key Difference: concat() is more versatile and can concatenate multiple DataFrames, while append() is simpler and is used to append one DataFrame at a time.
You can handle categorical data in Pandas by using the Categorical data type. This data type is useful when you have a column with a limited number of possible values (categories), as it is more memory efficient and allows for more optimized operations.
Example:
df = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']
})
# Convert 'Color' column to categorical type
df['Color'] = pd.Categorical(df['Color'])
print(df)
Pandas also supports CategoricalDtype, and you can use category to manage categorical data with methods like cat.codes to map categories to numerical codes.
The cut() function is used to segment and sort data values into discrete bins or intervals. This is useful for converting continuous data into categorical data by grouping values into bins.
Syntax
pd.cut(data, bins)
Example
df = pd.DataFrame({'Age': [22, 25, 30, 35, 40, 45]})
# Define bins for age groups
bins = [20, 30, 40, 50]
# Cut the data into bins
df['Age_Group'] = pd.cut(df['Age'], bins)
print(df)
Output
Age Age_Group
0 22 (20, 30]
1 25 (20, 30]
2 30 (30, 40]
3 35 (30, 40]
4 40 (40, 50]
5 45 (40, 50]
The cut() function divides the age data into three bins: (20, 30], (30, 40], and (40, 50]. You can also customize the bin labels and whether to include right or left edges in the intervals.
You can sample random rows from a DataFrame using the sample() method. The sample() method allows you to randomly select a specified number of rows (or a fraction of the data) from the DataFrame.
Syntax:
df.sample(n=number_of_rows)
Where n is the number of rows you want to sample.
You can also specify the fraction of rows to sample using the frac parameter.
Example:
Copy code
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45]
})
# Sample 2 random rows from the DataFrame
sampled_df = df.sample(n=2)
print(sampled_df)
To sample a fraction of the rows (e.g., 50%)
sampled_df = df.sample(frac=0.5)
If you want reproducible results (i.e., the same random rows on every run), you can set the random seed:
Pandas provides the str accessor to perform vectorized string operations on a column of text data. The str methods are very similar to Python's built-in string methods but are optimized for working with Series of strings.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
# Convert all names to uppercase
df['Name_upper'] = df['Name'].str.upper()
# Check if a name contains the letter 'a'
df['Contains_a'] = df['Name'].str.contains('a')
print(df)
Output
Name Name_upper Contains_a
0 Alice ALICE True
1 Bob BOB False
2 Charlie CHARLIE True
3 David DAVID True
Some other commonly used str methods include:
The str.contains() method is used to check if a substring or regular expression pattern exists within each string of a Series. It returns a boolean Series indicating whether the substring is present in each string.
Syntax:
df['column_name'].str.contains('substring', na=False)
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
# Check if the name contains the letter 'a'
df['Has_a'] = df['Name'].str.contains('a')
print(df)
Output
Name Has_a
0 Alice True
1 Bob False
2 Charlie True
3 David True
In this example, the str.contains('a') checks each name for the presence of the letter 'a'.
Note: You can also use regular expressions in the str.contains() method for more advanced matching.
You can filter rows based on the presence of a substring by using str.contains() along with boolean indexing. This allows you to create a condition where you only select rows that contain the substring in the specified column.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]
})
# Filter rows where the 'Name' contains the substring 'a'
filtered_df = df[df['Name'].str.contains('a', case=False)]
print(filtered_df)
Output
Name Age
0 Alice 25
2 Charlie 35
3 David 40
In this example, the str.contains('a', case=False) method filters rows where the 'Name' column contains the letter 'a' (case-insensitive).
You can also apply regular expressions or use the na parameter for missing values:
df[df['Name'].str.contains('a', na=False)]
Normalization or scaling is a technique to adjust the values in a numeric column so they fall within a specific range (commonly [0, 1]). You can normalize data using MinMaxScaler from sklearn.preprocessing, or manually using pandas operations.
Using MinMaxScaler from sklearn:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])
Manual normalization:
To scale a column manually to the range [0, 1], you can use the following formula:
normalized=x−min(x)max(x)−min(x)\text{normalized} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}normalized=max(x)−min(x)x−min(x)
Example:
import pandas as pd
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45]
})
# Normalize the 'Age' column
df['Age_normalized'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
print(df)
Output
Age Age_normalized
0 25 0.0
1 30 0.25
2 35 0.5
3 40 0.75
4 45 1.0
pivot(): The pivot() function is used to reshape data by turning unique values from one column into separate columns. It is typically used when you want to "spread" a column's values into multiple columns.Example of pivot():
df = pd.DataFrame({
'Date': ['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-02'],
'City': ['New York', 'New York', 'Boston', 'Boston'],
'Temperature': [30, 32, 25, 28]
})
pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_df)
Output
City Boston New York
Date
2023-01-01 25 30
2023-01-02 28 32
melt(): The melt() function is the inverse of pivot(). It "unpacks" a DataFrame and converts it from a wide format to a long format by gathering columns into a single column.Example of melt():
df_melted = pivot_df.reset_index().melt(id_vars='Date', value_vars=['Boston', 'New York'], var_name='City', value_name='Temperature')
print(df_melted)
Output
Date City Temperature
0 2023-01-01 Boston 25
1 2023-01-02 Boston 28
2 2023-01-01 New York 30
3 2023-01-02 New York 32
Pandas provides various tools for working with time-series data. The to_datetime() function allows you to convert strings or other data types into datetime objects, enabling time-series functionality such as resampling, shifting, and rolling window operations.
Example of time-series analysis:
import pandas as pd
# Create a DataFrame with a date range
df = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
'Sales': [100, 150, 200, 250, 300]
})
df.set_index('Date', inplace=True)
print(df)
Output
Sales
Date
2023-01-01 100
2023-01-02 150
2023-01-03 200
2023-01-04 250
2023-01-05 300
You can now perform time-based indexing, resampling, or apply rolling operations.
Resampling allows you to change the frequency of time-series data. You can use resample() to upsample or downsample time-series data based on a specified time frequency (e.g., daily, monthly).
Example:
df = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
'Sales': [100, 150, 200, 250, 300]
})
df.set_index('Date', inplace=True)
# Resample data to monthly frequency, using the sum of sales
monthly_sales = df.resample('M').sum()
print(monthly_sales)
Output:
Sales
Date
2023-01-31 1000
You can convert a column to datetime format using pd.to_datetime(). This function automatically detects the format of the date strings and converts them into datetime objects.
Example:
df = pd.DataFrame({
'Date': ['2023-01-01', '2023-02-01', '2023-03-01']
})
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Output
Date
0 2023-01-01
1 2023-02-01
2 2023-03-01
Pandas provides the fillna() method to fill missing values. The method='ffill' (forward fill) and method='bfill' (backward fill) parameters allow you to propagate the last valid value forward or the next valid value backward.
Example:
df = pd.DataFrame({
'Value': [10, None, 20, None, 30]
})
# Forward fill missing values
df['Value_ffill'] = df['Value'].fillna(method='ffill')
# Backward fill missing values
df['Value_bfill'] = df['Value'].fillna(method='bfill')
print(df)
Output
Value Value_ffill Value_bfill
0 10.0 10.0 10.0
1 NaN 10.0 20.0
2 20.0 20.0 20.0
3 NaN 20.0 30.0
4 30.0 30.0 30.0
To drop rows where a column contains a specific value, you can use boolean indexing combined with the drop() method or loc[].
Example:
Suppose you have a DataFrame where you want to remove rows where the 'Age' column has a value of 30.
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 30]
})
# Drop rows where 'Age' is 30
df = df[df['Age'] != 30]
print(df)
Output
Name Age
0 Alice 25
2 Charlie 35
Alternatively, you can use the drop() method with a condition:
df = df.drop(df[df['Age'] == 30].index)