The pandas library is one of the most powerful and popular tools for data analysis and manipulation in Python. It provides data structures like DataFrame and Series for handling structured data, such as tables in a database or spreadsheet.
Here are 10 Python snippets demonstrating common data analysis tasks using pandas:
1. Creating a DataFrame
Creating a DataFrame from a dictionary of lists.
import pandas as pddata ={'Name':['Alice','Bob','Charlie','David'],'Age':[24,27,22,32],'City':['New York','Los Angeles','Chicago','Miami']}df = pd.DataFrame(data)print(df)
Explanation:
A DataFrame is created from a dictionary, where the keys are the column names and the values are lists of data.
2. Reading Data from a CSV File
Reading a CSV file into a DataFrame.
Explanation:
pd.read_csv() loads data from a CSV file into a DataFrame.
3. DataFrame Selection and Indexing
Selecting a single column or multiple columns from a DataFrame.
Explanation:
Use df['column_name'] for selecting a single column and df[['col1', 'col2']] for selecting multiple columns.
4. Filtering Data
Filtering data based on conditions.
Explanation:
You can filter a DataFrame by applying a condition on columns like df[df['Age'] > 23].
5. Handling Missing Data
Handling missing or NaN values in a DataFrame.
Explanation:
df.fillna(value) replaces NaN values with the specified value.
6. Grouping Data
Grouping data by one or more columns and performing aggregation.
Explanation:
df.groupby('City') groups the data by the 'City' column and allows performing aggregation functions like mean().
7. Sorting Data
Sorting a DataFrame by one or more columns.
Explanation:
df.sort_values('column_name') sorts the DataFrame by the specified column. Use ascending=False for descending order.
8. Applying Functions to Columns
Applying a custom function to each element of a column.
Explanation:
df['Age'].apply(func) applies a custom function to each element in the 'Age' column.
9. Merging DataFrames
Merging two DataFrames on a common column.
Explanation:
pd.merge(df1, df2, on='column_name') merges two DataFrames based on a common column. The how parameter defines the type of join: inner, outer, left, or right.
10. Pivot Table
Creating a pivot table to summarize data.
Explanation:
pd.pivot_table(df, values='column_name', index='group_column') creates a pivot table that summarizes the data, allowing for aggregation functions like mean, sum, count, etc.
Conclusion:
pandas provides a comprehensive set of tools to handle and analyze structured data. Whether you're performing basic data manipulation, cleaning, aggregation, or advanced data analysis, pandas simplifies the task, allowing you to focus on the logic of your analysis rather than the implementation details.
import pandas as pd
df = pd.read_csv('data.csv') # Replace 'data.csv' with your file path
print(df.head()) # Display the first 5 rows of the DataFrame
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)
# Select a single column
print(df['Name'])
# Select multiple columns
print(df[['Name', 'Age']])
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22], 'City': ['New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)
# Filter rows where Age is greater than 23
filtered_df = df[df['Age'] > 23]
print(filtered_df)
import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, np.nan, 22]}
df = pd.DataFrame(data)
# Fill missing values with a default value
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [24, 27, 22, 32], 'City': ['NY', 'LA', 'NY', 'LA']}
df = pd.DataFrame(data)
# Group by 'City' and calculate the mean age
grouped = df.groupby('City')['Age'].mean()
print(grouped)
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)
# Sort by Age in ascending order
sorted_df = df.sort_values('Age', ascending=True)
print(sorted_df)
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)
# Create a function to convert age to age group
def categorize_age(age):
if age < 25:
return 'Young'
elif 25 <= age < 30:
return 'Mid-age'
else:
return 'Older'
# Apply the function to the 'Age' column
df['Age Group'] = df['Age'].apply(categorize_age)
print(df)