Pandas Interview Questions and Answers

Find 100+ Pandas interview questions and answers to assess candidates' skills in data manipulation, DataFrames, series, aggregation, and data analysis with Python.
By
WeCP Team

As Python remains the language of choice for data analysis and data science, Pandas has become indispensable for handling, manipulating, and analyzing structured data efficiently. Recruiters must identify candidates skilled in data wrangling, aggregation, and transformation using Pandas to ensure accurate, high-quality insights.

This resource, "100+ Pandas Interview Questions and Answers," is tailored for recruiters to simplify the evaluation process. It covers topics from basic data manipulation to advanced data analysis and performance optimization.

Whether hiring for Data Analysts, Data Scientists, or Machine Learning Engineers, this guide enables you to assess a candidate’s:

Core Pandas Knowledge

  • Data Structures: Series and DataFrame, their attributes, methods, and indexing.
  • Data Selection & Filtering: .loc, .iloc, boolean indexing, and conditional selection.
  • Data Cleaning: Handling missing values, duplicates, and type conversions.
  • Basic Operations: Sorting, merging, concatenation, and grouping.

Advanced Skills

  • Aggregation & GroupBy Operations: Multi-level grouping, applying custom functions, and pivot tables.
  • Time Series Analysis: Resampling, rolling windows, and datetime indexing.
  • Performance Optimization: Vectorized operations, memory-efficient data types, and avoiding loops.
  • Data Visualization Integration: Using Pandas with Matplotlib or Seaborn for quick insights.

Real-World Proficiency

  • Transforming raw datasets into clean, analysis-ready formats.
  • Performing complex multi-step aggregations for reporting or analytics.
  • Debugging and profiling data pipelines for efficiency.
  • Integrating Pandas workflows into machine learning or BI pipelines.

For a streamlined assessment process, consider platforms like WeCP, which allow you to:

Create customized Pandas assessments tailored to your data analysis and reporting needs.
Include hands-on tasks, such as cleaning messy datasets, performing complex groupby operations, or generating pivot tables and insights.
Proctor assessments remotely with AI-based monitoring to ensure integrity.
Leverage automated grading to evaluate correctness, efficiency, and adherence to best practices in data manipulation.

Save time, ensure technical depth, and confidently hire Pandas professionals who can analyze and transform data accurately and efficiently from day one.

Pandas Interview Questions

Beginner Level Question

  1. What is Pandas and how does it differ from NumPy?
  2. What is a DataFrame in Pandas?
  3. What are the key data structures in Pandas?
  4. What is the difference between a DataFrame and a Series in Pandas?
  5. How do you create a Pandas DataFrame?
  6. How do you create a Series in Pandas?
  7. What is the purpose of pd.read_csv() function?
  8. How can you read an Excel file in Pandas?
  9. How do you check the first few rows of a DataFrame?
  10. What is the function used to get basic statistics of a DataFrame?
  11. How do you select a single column from a DataFrame?
  12. How can you select multiple columns from a DataFrame?
  13. How can you filter rows in a DataFrame based on a condition?
  14. How do you get the shape of a DataFrame?
  15. What is the use of info() method in Pandas?
  16. How do you handle missing data in Pandas?
  17. What does NaN represent in a DataFrame?
  18. How do you fill missing values in a DataFrame?
  19. How can you drop missing values in a DataFrame?
  20. What is the purpose of dropna() method in Pandas?
  21. How do you sort a DataFrame by one or more columns?
  22. What does the iloc[] function do in Pandas?
  23. What does the loc[] function do in Pandas?
  24. How can you rename columns in a DataFrame?
  25. How do you reset the index of a DataFrame?
  26. What is the purpose of groupby() function in Pandas?
  27. How do you filter DataFrame columns based on a condition?
  28. How do you combine two DataFrames vertically (append)?
  29. How can you merge two DataFrames based on a common column?
  30. What is the difference between merge() and concat() functions in Pandas?
  31. How can you check for duplicate values in a DataFrame?
  32. How do you remove duplicate rows in a DataFrame?
  33. How do you apply a function to a DataFrame column?
  34. What is the purpose of apply() function in Pandas?
  35. How do you change the datatype of a DataFrame column?
  36. What is the purpose of astype() method in Pandas?
  37. How do you perform basic arithmetic operations on DataFrame columns?
  38. How do you check if a DataFrame contains null values?
  39. What is the difference between at[] and iat[] in Pandas?
  40. What is the use of pd.to_datetime() function?

Intermediate Level Question

  1. How can you filter rows of a DataFrame based on multiple conditions?
  2. Explain the difference between merge() and join() in Pandas.
  3. How do you change the index of a DataFrame?
  4. What is the purpose of pivot_table() in Pandas?
  5. How do you perform a groupby operation and then aggregate the data?
  6. What is a MultiIndex in Pandas? How do you work with it?
  7. How do you concatenate DataFrames along columns and rows?
  8. Explain the difference between concat() and append() in Pandas.
  9. How do you handle categorical data in Pandas?
  10. What is the cut() function used for in Pandas?
  11. How can you sample random rows from a DataFrame?
  12. How do you perform string operations in Pandas?
  13. What is the purpose of str.contains() method in Pandas?
  14. How do you filter rows based on the presence of a substring in a column?
  15. How do you normalize a DataFrame (e.g., scale numeric data)?
  16. What is the difference between pivot() and melt() functions in Pandas?
  17. How can you perform time-series analysis in Pandas?
  18. How do you resample time-series data?
  19. How do you convert a DataFrame column to datetime format?
  20. How do you fill missing values using forward fill or backward fill in Pandas?
  21. How do you drop rows with a specific value in a column?
  22. How do you handle outliers in a Pandas DataFrame?
  23. What is the purpose of rolling() method in Pandas?
  24. How do you compute the moving average of a DataFrame?
  25. How do you work with JSON data in Pandas?
  26. How can you apply functions to groups within a groupby operation?
  27. How do you perform a SQL-like operation (such as join, left join) using Pandas?
  28. How do you compute the cumulative sum or cumulative product in Pandas?
  29. What is the purpose of transform() function in Pandas?
  30. What is the difference between agg() and apply() in a groupby operation?
  31. How do you get the cumulative sum for each group in a groupby operation?
  32. How do you set a specific column as the index in a DataFrame?
  33. How do you convert a DataFrame index to a column?
  34. How do you handle duplicate indices in Pandas?
  35. How do you convert categorical columns into dummy/indicator variables?
  36. What is the difference between iterrows() and itertuples() in Pandas?
  37. How do you deal with large datasets in Pandas efficiently?
  38. How can you optimize memory usage in a large DataFrame?
  39. How do you deal with timezones in Pandas time-series data?
  40. How do you sort a DataFrame by its index?

Experienced Level Question

  1. How do you optimize performance when working with large datasets in Pandas?
  2. What are the various methods for joining multiple DataFrames in Pandas?
  3. How can you handle duplicate data efficiently in Pandas?
  4. How would you deal with missing values in time-series data?
  5. Explain the concept of vectorization and why it is important in Pandas.
  6. How would you handle out-of-memory issues when working with large datasets?
  7. How do you efficiently compute group-wise operations in Pandas (e.g., sum, mean)?
  8. How do you handle irregular time-series data (e.g., missing timestamps)?
  9. How can you perform custom aggregation functions in a groupby() operation?
  10. How can you perform parallel computing with Pandas?
  11. How do you convert a DataFrame into a sparse DataFrame?
  12. What is the role of the applymap() function in Pandas?
  13. How do you create and use a custom Pandas extension type?
  14. Explain the internal architecture of Pandas DataFrames.
  15. What are some ways to optimize memory usage when creating large DataFrames?
  16. How do you benchmark the performance of Pandas operations?
  17. How can you efficiently merge large datasets in Pandas?
  18. Explain how to perform window-based computations (rolling, expanding) in Pandas.
  19. What are the differences between query() and loc[] in Pandas for filtering?
  20. How do you implement time-based indexing and resampling for financial data?
  21. How do you handle data imputation in a highly imbalanced dataset?
  22. What is the difference between merge() and concat() when performing SQL-like joins?
  23. How do you handle multi-level columns or multi-indexes for hierarchical data?
  24. How can you efficiently perform data wrangling on a large unstructured dataset?
  25. How do you perform outlier detection and removal in large datasets?
  26. How would you profile a large dataset to identify performance bottlenecks in Pandas?
  27. Explain the role of the Categorical data type in Pandas for memory optimization.
  28. What is the difference between apply() and map() functions in Pandas?
  29. How can you scale a Pandas-based solution to work with distributed computing (e.g., Dask)?
  30. How do you perform a left-outer, right-outer, inner, or full join in Pandas?
  31. How do you work with non-tabular data (like JSON, XML) in Pandas?
  32. How do you handle time-zone-aware time-series data in Pandas?
  33. Explain the difference between astype() and convert_dtypes() in Pandas.
  34. How do you aggregate and visualize data at scale with Pandas?
  35. What is a performance trade-off between using apply() and vectorized operations?
  36. How do you optimize memory when working with categorical features in a dataset?
  37. How would you perform feature engineering using Pandas for machine learning?
  38. Explain how block-wise operations can improve performance in Pandas.
  39. How do you parallelize and distribute Pandas tasks for large datasets using frameworks like Dask or Ray?
  40. How would you manage large-scale datasets in memory using Pandas' HDF5 storage format?

Pandas Interview Questions and Answers

Beginners Question with Answers

1. What is Pandas and how does it differ from NumPy?

Pandas is an open-source data manipulation and analysis library built on top of NumPy. It provides powerful, flexible data structures that allow you to work with structured data more easily. While NumPy is primarily designed for numerical computing, Pandas adds support for labeled axes (rows and columns), making it easier to handle and analyze data in tabular form (e.g., spreadsheets or databases).

Key differences:

  • Data Structures: NumPy primarily uses arrays, which are homogeneous (i.e., all elements must be of the same type), whereas Pandas offers two main data structures: Series (1D) and DataFrame (2D), both of which can handle heterogeneous data types (e.g., integers, floats, strings, dates).
  • Labeling: Pandas allows you to label rows and columns with meaningful identifiers (e.g., strings), making the data more intuitive. NumPy, on the other hand, uses integer-based indexing for array elements.
  • Functionality: While NumPy is mainly used for numerical computations and scientific computing, Pandas provides high-level functionality for data manipulation, including filtering, grouping, joining, reshaping, and time series analysis.

In short, while both are built on the same core principles, Pandas extends NumPy’s capabilities by adding more high-level data manipulation features suitable for practical data analysis.

2. What is a DataFrame in Pandas?

A DataFrame is a 2-dimensional labeled data structure in Pandas, similar to a table or a spreadsheet, with rows and columns. It is one of the core data structures in Pandas and is used for storing and manipulating structured data.

A DataFrame can hold data of different types (e.g., integers, floats, strings) in each column, and each row is labeled with an index (often defaulting to integers). The columns are also labeled with headers. A DataFrame can be created from various data sources such as lists, dictionaries, NumPy arrays, and external data files like CSV or Excel.

Key features of a DataFrame:

  • Labeled axes (rows and columns)
  • Flexible and powerful indexing (e.g., hierarchical indexing)
  • Supports various operations like sorting, filtering, grouping, merging, and reshaping
  • Easily handles missing data
  • Optimized for high-performance computation

Example of creating a simple DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

3. What are the key data structures in Pandas?

Pandas provides two main data structures:

  1. Series:
    • A Series is a 1-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.).
    • It has a single axis (an index) and is similar to a list or a 1D NumPy array but with labeled indices.
    • A Series is created from lists, NumPy arrays, or dictionaries.

Example:

import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
  1. DataFrame:
    • A DataFrame is a 2-dimensional labeled data structure, with rows and columns, similar to a table or a spreadsheet.
    • A DataFrame is composed of multiple Series sharing the same index.
    • It can hold heterogeneous data types across columns (e.g., strings, integers, floats).

Example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

In addition to these, Pandas also supports other specialized data structures like Categorical (for handling categorical data efficiently) and DatetimeIndex (for time-based data).

4. What is the difference between a DataFrame and a Series in Pandas?

  • Series:
    • A Series is essentially a 1-dimensional array-like object, with an index.
    • It can be thought of as a single column in a DataFrame.
    • A Series can hold a variety of data types (integers, floats, strings, etc.), but it has a single axis (an index).

Example of a Series:

import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
  • DataFrame:
    • A DataFrame is a 2-dimensional table, composed of rows and columns, with both row and column labels (indices).
    • A DataFrame can hold multiple Series, where each Series represents a column.
    • It can hold heterogeneous data types across different columns (e.g., a column with integers and another with strings).

Example of a Data Frame:

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})

So, the primary difference is dimensionality: a Series is one-dimensional (single column or row), while a DataFrame is two-dimensional (multiple rows and columns).

5. How do you create a Pandas DataFrame?

A Pandas DataFrame can be created in several ways. Some common methods include:

From a dictionary of lists or arrays:

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

From a list of lists or tuples:

data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

From a NumPy array:

import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

From a CSV file:

df = pd.read_csv('data.csv')

From an existing Series:

s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
df = pd.DataFrame(s)

These methods offer flexibility in how you structure and import data into a DataFrame.

6. How do you create a Series in Pandas?

A Pandas Series can be created in several ways:

From a list or array:

import pandas as pd
s = pd.Series([1, 2, 3, 4])

From a dictionary:\

s = pd.Series({'a': 1, 'b': 2, 'c': 3})

From a NumPy array:

import numpy as np
arr = np.array([1, 2, 3])
s = pd.Series(arr)

With a custom index:

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

The index of a Series is optional, but it can be set explicitly to assign meaningful labels to the data points.

7. What is the purpose of pd.read_csv() function?

The pd.read_csv() function is used to load data from a CSV (Comma Separated Values) file into a Pandas DataFrame. It is one of the most commonly used functions for reading tabular data into Pandas. This function automatically handles parsing and data type inference, making it easy to work with data from CSV files.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

You can also pass parameters to control how the CSV is read, such as specifying the delimiter (sep), handling missing values (na_values), and parsing dates (parse_dates).

8. How can you read an Excel file in Pandas?

To read data from an Excel file, you can use the pd.read_excel() function. This function allows you to load data from Excel files into a Pandas DataFrame. It supports both .xls and .xlsx file formats.

Example:

import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

You can also specify additional parameters like usecols (to specify which columns to load), skiprows (to skip rows at the start), and dtype (to specify data types for columns).

9. How do you check the first few rows of a DataFrame?

To check the first few rows of a DataFrame, you can use the head() method. By default, it returns the first 5 rows, but you can specify the number of rows to return by passing an argument.

Example:

df.head()  # First 5 rows
df.head(10)  # First 10 rows

This method is useful for quickly inspecting the content of a DataFrame to verify its structure and data.

10. What is the function used to get basic statistics of a DataFrame?

To get basic statistical details of a DataFrame, such as mean, median, standard deviation, etc., you can use the describe() method. This function returns a summary of statistics for each numeric column, including count, mean, standard deviation, min, max, and quantiles.

Example:

df.describe()

For categorical columns, you can use df['column'].value_counts() to see the frequency distribution of each category.

11. How do you select a single column from a DataFrame?

In Pandas, selecting a single column from a DataFrame can be done using the column name either as a string key or as an attribute. Here are two common ways to select a single column:

Using the column name as a key:

df['column_name']

Using attribute-style access (this method works only if the column name is a valid Python identifier and does not conflict with DataFrame methods):

df.column_name

Both methods return a Pandas Series object, which is a 1D array of values from the selected column.

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
age_column = df['Age']

12. How can you select multiple columns from a DataFrame?

To select multiple columns from a DataFrame, you need to pass a list of column names inside square brackets. This returns a DataFrame consisting of only the selected columns.

Example:

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
})
selected_columns = df[['Name', 'Age']]

In this example, selected_columns will be a DataFrame with the Name and Age columns.

13. How can you filter rows in a DataFrame based on a condition?

You can filter rows in a DataFrame based on a condition by using boolean indexing. This involves applying a condition to one or more columns, which returns a boolean Series that you can use to filter the rows.

Example:

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]

In this example, filtered_df will contain the row where Charlie is the only one with an age greater than 30.

You can also filter using multiple conditions:

filtered_df = df[(df['Age'] > 25) & (df['Name'] == 'Bob')]

Here, the & operator is used to combine multiple conditions, and you must wrap each condition in parentheses.

14. How do you get the shape of a DataFrame?

The shape of a DataFrame can be obtained using the .shape attribute, which returns a tuple representing the number of rows and columns in the DataFrame.

Example:

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

shape = df.shape
print(shape)  # Output: (3, 2) -- 3 rows and 2 columns

Here, df.shape returns a tuple (3, 2), where 3 is the number of rows and 2 is the number of columns.

15. What is the use of info() method in Pandas?

The info() method in Pandas provides a concise summary of a DataFrame. It gives the following useful information:

  • The number of non-null entries in each column
  • The datatype of each column (e.g., int64, float64, object)
  • The memory usage of the DataFrame

Example:

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
})

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 111.0+ bytes

This is especially useful for quickly understanding the structure of your data, identifying missing values, and checking column types.

16. How do you handle missing data in Pandas?

Handling missing data is a common task when working with real-world data. Pandas provides several ways to handle missing values, which are represented as NaN (Not a Number).

Identifying missing data: You can use the .isna() or .isnull() methods to identify missing values. Both return a DataFrame of the same shape, with True for missing values and False for non-missing values. Example:

df.isna()

Dropping missing values: You can remove rows or columns with missing values using the .dropna() method. Example:

df.dropna()  # Drops any rows with NaN values
df.dropna(axis=1)  # Drops any columns with NaN values

Filling missing values: The .fillna() method allows you to fill missing values with a specific value, such as the mean or median, or with forward or backward filling. Example:

df.fillna(0)  # Replace NaN with 0
df.fillna(df['Age'].mean())  # Replace NaN in 'Age' column with the mean

17. What does NaN represent in a DataFrame?

NaN stands for Not a Number, and it represents missing or undefined data in a Pandas DataFrame. It is the standard missing value marker in Pandas and is part of the IEEE floating point standard.

NaN is typically used in DataFrames when:

  • A value is missing, null, or undefined
  • A calculation cannot be performed (e.g., division by zero)
  • Data from external sources is incomplete

You can check for NaN values using the .isna() or .isnull() method.

Example:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, np.nan, 35]
})
df.isna()  # Check for NaN values

Output:

  Name    Age
0  False  False
1  False   True
2  False  False

18. How do you fill missing values in a DataFrame?

Pandas provides several strategies for filling missing data using the .fillna() method. You can fill missing values with a constant value, or with calculated values like the mean, median, or mode of a column. You can also use forward fill (ffill) or backward fill (bfill), where missing values are replaced with the previous or next valid value, respectively.

Filling with a constant:

df.fillna(0)  # Replace all NaN values with 0

Filling with the mean of a column:

df['Age'] = df['Age'].fillna(df['Age'].mean())

Forward fill (propagate previous valid value):

df.fillna(method='ffill')

Backward fill (propagate next valid value):

df.fillna(method='bfill')

19. How can you drop missing values in a DataFrame?

To drop missing values in a Pandas DataFrame, you can use the .dropna() method. This will remove any rows or columns that contain NaN values.

Drop rows with missing values:

df.dropna()  # Drops any rows with NaN values

Drop columns with missing values:

df.dropna(axis=1)  # Drops any columns with NaN values

You can also specify the threshold for dropping rows or columns, for example, only drop rows with at least 2 non-null values:

df.dropna(thresh=2)  # Keep rows with at least 2 non-null values

20. What is the purpose of dropna() method in Pandas?

The dropna() method is used to remove missing values from a DataFrame by deleting either rows or columns that contain NaN values. This is useful when you want to clean your data by removing incomplete or missing entries.

You can specify whether to drop rows or columns, and also define a threshold for the number of non-null values required to keep a row or column.

Key parameters of dropna():

  • axis: 0 to drop rows (default), 1 to drop columns.
  • how: 'any' (default) to drop rows or columns with any NaN values, 'all' to drop rows or columns only if all values are NaN.
  • thresh: Require a minimum number of non-null values to keep the row or column.

Example:

python

df.dropna(axis=0, how='any')  # Drop rows with any NaN values
df.dropna(axis=1, how='all')  # Drop columns with all NaN values

21. How do you sort a DataFrame by one or more columns?

To sort a DataFrame in Pandas, you can use the .sort_values() method, which allows you to specify one or more columns by which to sort the data.

Sorting by a single column:

df.sort_values(by='column_name')

By default, .sort_values() sorts in ascending order. To sort in descending order, you can set the ascending parameter to False:

df.sort_values(by='column_name', ascending=False)

Sorting by multiple columns:

To sort by multiple columns, pass a list of column names to the by parameter:

df.sort_values(by=['column1', 'column2'])

You can also specify the sorting order for each column individually by passing a list to the ascending parameter:

df.sort_values(by=['column1', 'column2'], ascending=[True, False])

Example

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 25],
    'Score': [85, 90, 88, 95]
})
df.sort_values(by=['Age', 'Score'], ascending=[True, False])

This will first sort by Age in ascending order and then by Score in descending order.

22. What does the iloc[] function do in Pandas?

The iloc[] function in Pandas is used for integer-location based indexing. It allows you to select rows and columns from a DataFrame using integer indices (i.e., numerical positions).

Syntax:

df.iloc[row_index, column_index]
  • row_index: An integer (or list of integers) specifying the row(s) to select.
  • column_index: An integer (or list of integers) specifying the column(s) to select.

Examples:

Select the first row and first column:

df.iloc[0, 0]

Select the first 3 rows and all columns:

df.iloc[:3, :]

Select all rows and the first two columns:

df.iloc[:, :2]

Select a specific set of rows and columns (e.g., rows 1 and 2, columns 0 and 2):

df.iloc[[1, 2], [0, 2]]

23. What does the loc[] function do in Pandas?

The loc[] function in Pandas is used for label-based indexing. It allows you to select rows and columns by their label names (i.e., row and column names).

Syntax:

df.loc[row_label, column_label]
  • row_label: The label or index value of the row(s) to select.
  • column_label: The label of the column(s) to select.

Examples:

Select a specific row by label:

df.loc[0]  # Select the row with index label 0

Select a specific value in a row and column:

df.loc[0, 'Age']  # Select the 'Age' value of the row with index 0

Select multiple rows by label:

df.loc[0:2]  # Select rows with labels 0, 1, and 2

Select multiple rows and columns by label:

df.loc[0:2, ['Name', 'Score']]  # Select rows 0 to 2 and columns 'Name' and 'Score'

24. How can you rename columns in a DataFrame?

To rename columns in a DataFrame, use the rename() method. This method allows you to specify new column names by passing a dictionary of old column names to new column names.

df.rename(columns={'old_name': 'new_name'})

Example

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

df.rename(columns={'Name': 'Full Name', 'Age': 'Age Group'}, inplace=True)

The inplace=True argument modifies the original DataFrame; otherwise, it returns a new DataFrame with the updated column names.

25. How do you reset the index of a DataFrame?

To reset the index of a DataFrame, you can use the reset_index() method. This will move the current index to a new column and replace it with the default integer-based index.

Syntax:

df.reset_index(drop=False)
  • drop=False: The current index is added as a column.
  • drop=True: The current index is discarded without adding it as a column.

Example:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
df = df.set_index('Name')  # Setting 'Name' as the index
df.reset_index(drop=True, inplace=True)  # Reset index and drop old index column

26. What is the purpose of groupby() function in Pandas?

The groupby() function in Pandas is used to group a DataFrame based on one or more columns and then perform operations like aggregation, transformation, or filtering on each group.

Steps involved in groupby():

  1. Split the data into groups based on the values of one or more columns.
  2. Apply a function (e.g., aggregation or transformation) to each group.
  3. Combine the results into a new DataFrame.

Common operations:

  • Aggregation (e.g., sum, mean, count)
  • Transformation (e.g., normalization, filling)
  • Filtration (e.g., removing groups based on condition)

Example:

df = pd.DataFrame({
    'Team': ['A', 'B', 'A', 'B'],
    'Points': [10, 20, 15, 25]
})

grouped = df.groupby('Team').sum()  # Group by 'Team' and sum 'Points'

This will group the rows by the 'Team' column and sum the 'Points' for each group.

27. How do you filter DataFrame columns based on a condition?

You can filter columns in a DataFrame by applying a condition to the column values. This returns a new DataFrame with only the columns that satisfy the condition.

Example:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Score': [85, 90, 88]
})

filtered_df = df[df['Age'] > 30]  # Filter rows where Age is greater than 30

You can also filter columns (not rows) based on conditions:

filtered_columns = df.loc[:, df.mean() > 80]  # Keep columns with mean > 80

28. How do you combine two DataFrames vertically (append)?

To combine two DataFrames vertically (i.e., stacking them row-wise), you can use the append() method or the concat() function.

Example using append():

df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
})

df2 = pd.DataFrame({
    'Name': ['Charlie', 'David'],
    'Age': [35, 40]
})

combined = df1.append(df2, ignore_index=True)

Example using concat():

combined = pd.concat([df1, df2], ignore_index=True)

The ignore_index=True argument ensures the index is reset after appending.

29. How can you merge two DataFrames based on a common column?

To merge two DataFrames based on a common column, you can use the merge() function, which is similar to SQL join operations (e.g., inner join, left join).

Syntax:

df1.merge(df2, on='common_column', how='inner')
  • on='common_column': Specifies the column to join on.
  • how='inner': The type of join. Possible values are 'inner', 'outer', 'left', 'right'.

Example:

python

df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'ID': [2, 3, 4],
    'Age': [30, 35, 40]
})

merged_df = df1.merge(df2, on='ID', how='inner')

This will merge the DataFrames based on the ID column, keeping only rows with matching IDs in both DataFrames (inner join).

30. What is the difference between merge() and concat() functions in Pandas?

The main difference between merge() and concat() in Pandas lies in how they combine DataFrames:

  • merge() is used for combining DataFrames based on common columns (like SQL joins). You can specify the type of join ('inner', 'outer', 'left', 'right') and the key(s) for merging.
  • concat() is used to concatenate DataFrames along a particular axis (rows or columns), typically for stacking DataFrames either vertically (adding rows) or horizontally (adding columns).

Example using concat():

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df_concat = pd.concat([df1, df2], axis=1)

In summary:

  • Use merge() for SQL-like joins.
  • Use concat() for appending rows or columns directly.

31. How can you check for duplicate values in a DataFrame?

To check for duplicate rows in a DataFrame, you can use the duplicated() method. This method returns a boolean Series, where True indicates a duplicate row and False indicates a unique row. By default, it checks for duplicates in all columns.

Syntax:

df.duplicated()

You can also check for duplicates in specific columns by passing the column names to the subset parameter.

Example:

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35]
})

duplicates = df.duplicated()
print(duplicates)

OUTPUT

0    False
1    False
2     True
3    False
dtype: bool

In this case, the row with index 2 is a duplicate of the row with index 0.

32. How do you remove duplicate rows in a DataFrame?

To remove duplicate rows, you can use the drop_duplicates() method. By default, this method removes all rows that are duplicates, keeping the first occurrence.

Syntax:

df.drop_duplicates()

You can specify the columns on which to check for duplicates using the subset parameter and whether to keep the first, last, or none using the keep parameter.

Example:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35]
})

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

You can also keep the last occurrence of each duplicate:

python

df_no_duplicates_last = df.drop_duplicates(keep='last')

Or remove all duplicates:

df_no_duplicates_none = df.drop_duplicates(keep=False)

33. How do you apply a function to a DataFrame column?

You can apply a function to a DataFrame column using the apply() method. This allows you to apply a function to each element of the column.

Syntax:

df['column_name'].apply(function)

Example

import pandas as pd
df = pd.DataFrame({
    'Age': [25, 30, 35]
})

# Apply a function to square each element in the 'Age' column
df['Age_squared'] = df['Age'].apply(lambda x: x ** 2)
print(df)

Output

  Age  Age_squared
0   25          625
1   30          900
2   35         1225

The apply() method is very flexible and can be used to apply any custom function to the column.

34. What is the purpose of apply() function in Pandas?

The apply() function in Pandas is used to apply a function along an axis (either rows or columns) of a DataFrame or Series. It is a powerful tool for performing element-wise operations or aggregations, especially when the operation is complex and cannot be done directly with vectorized operations.

You can apply a function to:

  • A single column or row (Series)
  • The entire DataFrame (row-wise or column-wise)

Example (applying to rows):

df.apply(lambda row: row['Age'] * 2, axis=1)

Example (applying to columns):

df.apply(lambda col: col.max() - col.min())

In the first example, axis=1 indicates that the function is applied row-wise, while in the second example, axis=0 (default) applies the function column-wise.

35. How do you change the datatype of a DataFrame column?

You can change the datatype of a DataFrame column using the astype() method. This method allows you to specify the desired datatype for one or more columns.

Syntax:

df['column_name'] = df['column_name'].astype(new_type)

Example:

df = pd.DataFrame({
    'Age': ['25', '30', '35']
})

# Change the 'Age' column from string to integer
df['Age'] = df['Age'].astype(int)
print(df)

Outpout

  Age
0   25
1   30
2   35

You can also convert multiple columns at once by passing a dictionary to astype():

Example

df = df.astype({'Age': 'int64'})

36. What is the purpose of astype() method in Pandas?

The astype() method in Pandas is used to cast a column (or an entire DataFrame) to a specified data type. This is particularly useful when the data types of columns are not as expected (e.g., numeric values stored as strings).

Example:

df['Age'] = df['Age'].astype(float)

You can use astype() to convert columns to other numeric types (e.g., int, float) or to categorical types ('category'), datetime ('datetime64'), etc.

37. How do you perform basic arithmetic operations on DataFrame columns?

Pandas allows you to perform arithmetic operations on DataFrame columns directly. You can add, subtract, multiply, or divide columns in a straightforward manner.

Example:

python

df = pd.DataFrame({
    'Age': [25, 30, 35],
    'Score': [85, 90, 88]
})

# Add two columns
df['Total'] = df['Age'] + df['Score']

# Subtract two columns
df['Age_minus_Score'] = df['Age'] - df['Score']

# Multiply two columns
df['Age_times_Score'] = df['Age'] * df['Score']

# Divide two columns
df['Age_divided_by_Score'] = df['Age'] / df['Score']

These operations can also be performed with constants:

df['Age_plus_5'] = df['Age'] + 5

Pandas handles element-wise operations, so the operation is applied to each value in the columns.

38. How do you check if a DataFrame contains null values?

To check for null values (NaN) in a DataFrame, you can use the isna() or isnull() methods. These return a DataFrame of the same shape with True for each NaN value and False otherwise.

Example:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, None, 35]
})

df.isna()

This will return:

   Name    Age
0  False  False
1  False   True
2  False  False

To check if there are any null values in the entire DataFrame, use:

python

df.isna().any().any()  # Returns True if any NaN values are present

To count the number of null values per column:

df.isna().sum()

39. What is the difference between at[] and iat[] in Pandas?

The at[] and iat[] methods are used to access individual elements in a DataFrame, but they differ in how they work:

at[]: It is used for label-based indexing to access a single value from a DataFrame. You provide the row label and column label.python

df.at[1, 'Age']  # Access the value in row 1 and column 'Age'

iat[]: It is used for integer-location based indexing. You provide the row index and column index as integers.
python
Copy code
df.iat[1, 1]  # Access the value at row index 1 and column index 1

Both methods are faster than using loc[] or iloc[] when accessing a single value.

40. What is the use of pd.to_datetime() function?

The pd.to_datetime() function is used to convert a column or series of date-like objects (e.g., strings) into Pandas datetime objects. This is useful when dealing with date data stored as strings or other formats, as it ensures that the data is properly interpreted as datetime objects, allowing you to perform datetime-related operations.

Syntax:

pd.to_datetime(data)

Example

df = pd.DataFrame({
    'Date': ['2023-01-01', '2023-02-01', '2023-03-01']
})

df['Date'] = pd.to_datetime(df['Date'])

After conversion, the Date column will have the datetime64 type, enabling datetime operations like comparison, extraction of day/month/year, and more.

You can also specify the format of the date string for better performance:

python

df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

Intermediate Question with Answers

1. How can you filter rows of a DataFrame based on multiple conditions?

You can filter rows of a DataFrame based on multiple conditions by combining conditions with logical operators (e.g., & for AND, | for OR). Each condition must be enclosed in parentheses to ensure correct precedence

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Score': [85, 90, 88, 95]
})

# Filter rows where Age is greater than 30 and Score is greater than 90
filtered_df = df[(df['Age'] > 30) & (df['Score'] > 90)]
print(filtered_df)

This will output

   Name  Age  Score
3  David   40     95

In the example above:

  • df['Age'] > 30 filters rows where Age is greater than 30.
  • df['Score'] > 90 filters rows where Score is greater than 90.
  • The & operator combines the two conditions, selecting rows that meet both conditions.

Note: Use & for AND conditions and | for OR conditions. Always wrap each condition in parentheses to avoid syntax errors.

2. Explain the difference between merge() and join() in Pandas.

Both merge() and join() are used to combine two DataFrames, but they have different use cases and syntax:

  • merge(): This is a more flexible function used to perform SQL-like joins (inner, left, right, outer) based on columns or indices. It can join DataFrames using one or more columns as keys.

Syntax:

df1.merge(df2, on='key', how='inner')

Exzmple

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 35]})

merged_df = df1.merge(df2, on='ID', how='inner')
  • join(): This is a more straightforward method used to combine DataFrames based on their index or a column. By default, join() works on indices but can be used with columns as keys using the on parameter.

Syntax:

df1.join(df2, on='key', how='left')

Example

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'Age': [25, 30, 35]}, index=[1, 2, 3])

joined_df = df1.join(df2)

Key Differences:

  • merge() is more versatile and flexible, capable of performing a variety of joins based on columns and indices.
  • join() is simpler, commonly used when you want to join DataFrames based on their indices, but it can also handle column joins with the on parameter.

3. How do you change the index of a DataFrame?

You can change the index of a DataFrame using the set_index() method. This method allows you to set one or more columns as the new index of the DataFrame.

Syntax:

df.set_index('column_name', inplace=True)

Example

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Set 'Name' as the index
df.set_index('Name', inplace=True)
print(df)

output

        Age
Name        
Alice     25
Bob       30
Charlie   35

You can also reset the index to the default integer-based index using reset_index():

df.reset_index(inplace=True)

4. What is the purpose of pivot_table() in Pandas?

The pivot_table() function in Pandas is used to create a pivot table from a DataFrame. It allows you to aggregate data and reshape the data, summarizing it based on one or more grouping columns.

Syntax:

df.pivot_table(values='column_name', index='group_column', aggfunc='mean')
  • values: The column(s) to aggregate.
  • index: The column(s) to group by (rows of the pivot table).
  • aggfunc: The aggregation function (e.g., 'mean', 'sum', 'count', etc.).

Example:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice'],
    'Age': [25, 30, 35, 40, 25],
    'Score': [85, 90, 88, 95, 80]
})

pivot = df.pivot_table(values='Score', index='Name', aggfunc='mean')
print(pivot)

Example

        Score
Name           
Alice     82.5
Bob       90.0
Charlie   88.0
David     95.0

In this example, the pivot_table() calculates the mean score for each name, grouping the data by the 'Name' column.

5. How do you perform a groupby operation and then aggregate the data?

You can use the groupby() function followed by an aggregation function like sum(), mean(), or count() to group data and compute aggregate statistics.

Syntax:

df.groupby('group_column').agg(aggregation_function)

Example

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 25, 30, 35],
    'Score': [85, 90, 88, 92, 88]
})

grouped = df.groupby('Name').agg({'Age': 'mean', 'Score': 'mean'})
print(grouped)

Output

        Age  Score
Name                 
Alice   25.0   86.5
Bob     30.0   91.0
Charlie 35.0   88.0

In this example, groupby('Name') groups the data by the 'Name' column, and agg({'Age': 'mean', 'Score': 'mean'}) computes the mean of 'Age' and 'Score' for each group.

6. What is a MultiIndex in Pandas? How do you work with it?

A MultiIndex in Pandas allows you to work with multiple levels of indexing in a DataFrame. This is useful for representing hierarchical data, where each index can have multiple levels.

Creating a MultiIndex:

You can create a MultiIndex by passing a list of tuples or using the set_index() method with multiple columns.

Example:

df = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South'],
    'City': ['New York', 'Boston', 'Chicago', 'Miami'],
    'Sales': [100, 200, 300, 400]
})

df.set_index(['Region', 'City'], inplace=True)
print(df)

Output

              Sales
Region City          
North New York    100
      Boston      200
South Chicago     300
      Miami       400

In this example, 'Region' and 'City' form a MultiIndex.

Accessing data with MultiIndex:

You can access data in a MultiIndex using .loc[]:

python

df.loc[('North', 'New York')]

You can also reset the MultiIndex using reset_index():

df.reset_index(inplace=True)

7. How do you concatenate DataFrames along columns and rows?

You can concatenate DataFrames along rows or columns using the concat() function.

Concatenating along rows (vertically):

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})

result = pd.concat([df1, df2], axis=0, ignore_index=True)
print(result)

Output

     Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40

Concatenating along columns (horizontally):

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Score': [85, 90]})

result = pd.concat([df1, df2], axis=1)
print(result)

Output

     Name  Age  Score
0    Alice   25     85
1      Bob   30     90

8. Explain the difference between concat() and append() in Pandas.

concat(): It is used for concatenating multiple DataFrames along a particular axis (either rows or columns). It is more flexible and can concatenate more than two DataFrames at once. You can also control how the index is handled with the ignore_index parameter.Example:

pd.concat([df1, df2], axis=0)

append(): It is used to append one DataFrame to the end of another, row-wise (axis=0). append() is essentially a shorthand for concat(), but it is less flexible and can only append one DataFrame at a time. Example:

df1.append(df2, ignore_index=True)

Key Difference: concat() is more versatile and can concatenate multiple DataFrames, while append() is simpler and is used to append one DataFrame at a time.

9. How do you handle categorical data in Pandas?

You can handle categorical data in Pandas by using the Categorical data type. This data type is useful when you have a column with a limited number of possible values (categories), as it is more memory efficient and allows for more optimized operations.

Example:

df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']
})

# Convert 'Color' column to categorical type
df['Color'] = pd.Categorical(df['Color'])
print(df)

Pandas also supports CategoricalDtype, and you can use category to manage categorical data with methods like cat.codes to map categories to numerical codes.

10. What is the cut() function used for in Pandas?

The cut() function is used to segment and sort data values into discrete bins or intervals. This is useful for converting continuous data into categorical data by grouping values into bins.

Syntax

pd.cut(data, bins)

Example

df = pd.DataFrame({'Age': [22, 25, 30, 35, 40, 45]})

# Define bins for age groups
bins = [20, 30, 40, 50]

# Cut the data into bins
df['Age_Group'] = pd.cut(df['Age'], bins)
print(df)

Output

  Age    Age_Group
0   22   (20, 30]
1   25   (20, 30]
2   30   (30, 40]
3   35   (30, 40]
4   40   (40, 50]
5   45   (40, 50]

The cut() function divides the age data into three bins: (20, 30], (30, 40], and (40, 50]. You can also customize the bin labels and whether to include right or left edges in the intervals.

11. How can you sample random rows from a DataFrame?

You can sample random rows from a DataFrame using the sample() method. The sample() method allows you to randomly select a specified number of rows (or a fraction of the data) from the DataFrame.

Syntax:

df.sample(n=number_of_rows)

Where n is the number of rows you want to sample.

You can also specify the fraction of rows to sample using the frac parameter.

Example:

Copy code
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 45]
})

# Sample 2 random rows from the DataFrame
sampled_df = df.sample(n=2)
print(sampled_df)

To sample a fraction of the rows (e.g., 50%)

sampled_df = df.sample(frac=0.5)

If you want reproducible results (i.e., the same random rows on every run), you can set the random seed:

12. How do you perform string operations in Pandas?

Pandas provides the str accessor to perform vectorized string operations on a column of text data. The str methods are very similar to Python's built-in string methods but are optimized for working with Series of strings.

Example:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Convert all names to uppercase
df['Name_upper'] = df['Name'].str.upper()

# Check if a name contains the letter 'a'
df['Contains_a'] = df['Name'].str.contains('a')

print(df)

Output

     Name Name_upper  Contains_a
0    Alice      ALICE        True
1      Bob        BOB       False
2  Charlie    CHARLIE        True
3    David      DAVID        True

Some other commonly used str methods include:

  • str.lower(): Converts strings to lowercase.
  • str.len(): Returns the length of each string.
  • str.replace(): Replaces a substring within the string.
  • str.startswith(): Checks if the string starts with a specific prefix.

13. What is the purpose of str.contains() method in Pandas?

The str.contains() method is used to check if a substring or regular expression pattern exists within each string of a Series. It returns a boolean Series indicating whether the substring is present in each string.

Syntax:

df['column_name'].str.contains('substring', na=False)
  • na=False: This parameter ensures that NaN values are treated as False when checking for the substring.

Example:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Check if the name contains the letter 'a'
df['Has_a'] = df['Name'].str.contains('a')
print(df)

Output

     Name  Has_a
0    Alice   True
1      Bob  False
2  Charlie   True
3    David   True

In this example, the str.contains('a') checks each name for the presence of the letter 'a'.

Note: You can also use regular expressions in the str.contains() method for more advanced matching.

14. How do you filter rows based on the presence of a substring in a column?

You can filter rows based on the presence of a substring by using str.contains() along with boolean indexing. This allows you to create a condition where you only select rows that contain the substring in the specified column.

Example:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40]
})

# Filter rows where the 'Name' contains the substring 'a'
filtered_df = df[df['Name'].str.contains('a', case=False)]
print(filtered_df)

Output

     Name  Age
0    Alice   25
2  Charlie   35
3    David   40

In this example, the str.contains('a', case=False) method filters rows where the 'Name' column contains the letter 'a' (case-insensitive).

You can also apply regular expressions or use the na parameter for missing values:

df[df['Name'].str.contains('a', na=False)]

15. How do you normalize a DataFrame (e.g., scale numeric data)?

Normalization or scaling is a technique to adjust the values in a numeric column so they fall within a specific range (commonly [0, 1]). You can normalize data using MinMaxScaler from sklearn.preprocessing, or manually using pandas operations.

Using MinMaxScaler from sklearn:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])

Manual normalization:

To scale a column manually to the range [0, 1], you can use the following formula:

normalized=x−min(x)max(x)−min(x)\text{normalized} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}normalized=max(x)−min(x)x−min(x)​

Example:

import pandas as pd

df = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45]
})

# Normalize the 'Age' column
df['Age_normalized'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
print(df)

Output

 Age  Age_normalized
0   25             0.0
1   30             0.25
2   35             0.5
3   40             0.75
4   45             1.0

16. What is the difference between pivot() and melt() functions in Pandas?

pivot(): The pivot() function is used to reshape data by turning unique values from one column into separate columns. It is typically used when you want to "spread" a column's values into multiple columns.Example of pivot():

df = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-02'],
    'City': ['New York', 'New York', 'Boston', 'Boston'],
    'Temperature': [30, 32, 25, 28]
})

pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_df)

Output

City          Boston  New York
Date                          
2023-01-01     25        30
2023-01-02     28        32

melt(): The melt() function is the inverse of pivot(). It "unpacks" a DataFrame and converts it from a wide format to a long format by gathering columns into a single column.Example of melt():

df_melted = pivot_df.reset_index().melt(id_vars='Date', value_vars=['Boston', 'New York'], var_name='City', value_name='Temperature')
print(df_melted)

Output

       Date     City  Temperature
0  2023-01-01   Boston           25
1  2023-01-02   Boston           28
2  2023-01-01  New York           30
3  2023-01-02  New York           32

17. How can you perform time-series analysis in Pandas?

Pandas provides various tools for working with time-series data. The to_datetime() function allows you to convert strings or other data types into datetime objects, enabling time-series functionality such as resampling, shifting, and rolling window operations.

Example of time-series analysis:

import pandas as pd

# Create a DataFrame with a date range
df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
    'Sales': [100, 150, 200, 250, 300]
})

df.set_index('Date', inplace=True)
print(df)

Output

           Sales
Date             
2023-01-01    100
2023-01-02    150
2023-01-03    200
2023-01-04    250
2023-01-05    300

You can now perform time-based indexing, resampling, or apply rolling operations.

18. How do you resample time-series data?

Resampling allows you to change the frequency of time-series data. You can use resample() to upsample or downsample time-series data based on a specified time frequency (e.g., daily, monthly).

Example:

df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
    'Sales': [100, 150, 200, 250, 300]
})

df.set_index('Date', inplace=True)

# Resample data to monthly frequency, using the sum of sales
monthly_sales = df.resample('M').sum()
print(monthly_sales)

Output:

          Sales
Date             
2023-01-31    1000

19. How do you convert a DataFrame column to datetime format?

You can convert a column to datetime format using pd.to_datetime(). This function automatically detects the format of the date strings and converts them into datetime objects.

Example:

df = pd.DataFrame({
    'Date': ['2023-01-01', '2023-02-01', '2023-03-01']
})

df['Date'] = pd.to_datetime(df['Date'])
print(df)

Output

       Date
0 2023-01-01
1 2023-02-01
2 2023-03-01

20. How do you fill missing values using forward fill or backward fill in Pandas?

Pandas provides the fillna() method to fill missing values. The method='ffill' (forward fill) and method='bfill' (backward fill) parameters allow you to propagate the last valid value forward or the next valid value backward.

Example:

df = pd.DataFrame({
    'Value': [10, None, 20, None, 30]
})

# Forward fill missing values
df['Value_ffill'] = df['Value'].fillna(method='ffill')

# Backward fill missing values
df['Value_bfill'] = df['Value'].fillna(method='bfill')

print(df)

Output

  Value  Value_ffill  Value_bfill
0   10.0          10.0          10.0
1    NaN          10.0          20.0
2   20.0          20.0          20.0
3    NaN          20.0          30.0
4   30.0          30.0          30.0

21. How do you drop rows with a specific value in a column?

To drop rows where a column contains a specific value, you can use boolean indexing combined with the drop() method or loc[].

Example:

Suppose you have a DataFrame where you want to remove rows where the 'Age' column has a value of 30.

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 30]
})

# Drop rows where 'Age' is 30
df = df[df['Age'] != 30]
print(df)

Output

     Name  Age
0    Alice   25
2  Charlie   35

Alternatively, you can use the drop() method with a condition:

df = df.drop(df[df['Age'] == 30].index)

WeCP Team
Team @WeCP
WeCP is a leading talent assessment platform that helps companies streamline their recruitment and L&D process by evaluating candidates' skills through tailored assessments