Exploratory Data Analysis (EDA) with Python: An In-Depth Guide Using Essential Functions

In data analysis, understanding your dataset’s structure and distribution is crucial before making any interpretations or applying models. Exploratory Data Analysis (EDA) provides this understanding through systematic exploration. Here, we’ll focus on using Python functions to gain insights without relying heavily on graphical methods, though we’ll also touch on some visualization techniques.

Step 1: Loading and Inspecting the Dataset

We’ll start with the Titanic dataset, a popular dataset in data analysis, and set up the environment by importing necessary libraries.

 
import pandas as pd
import numpy as np
import seaborn as sns

# Load the data
df = pd.read_csv('titanic.csv')

# Preview the data
df.head()

This code will load and display the first few rows of the Titanic dataset, giving you a quick overview of its structure.

Step 2: Basic Dataset Information

It’s important to familiarize yourself with the dataset’s structure. The info() and describe() functions provide a high-level summary of the data.

 
# Basic information about the dataset
df.info()

# Descriptive statistics
df.describe()

The info() function reveals data types and missing values, while describe() provides basic statistics for numerical columns.

Step 3: Identifying Duplicate Entries

Duplicate data can bias results, so it’s good to identify any duplicate rows early on.

 
# Count duplicate rows
df.duplicated().sum()

A result of 0 indicates no duplicates, ensuring data integrity.

Step 4: Exploring Unique Values

Understanding the range of values within categorical columns is helpful, especially for feature analysis.

 
# Unique values in specific columns
print(df['Pclass'].unique())
print(df['Survived'].unique())
print(df['Sex'].unique())

This returns the distinct values within each specified column.

Step 5: Visualizing Counts of Unique Values

Visualizations like count plots make it easier to see the frequency of categories within a column.

 
# Count plot for unique values in 'Pclass'
sns.countplot(x='Pclass', data=df)

This plot reveals the distribution of values in the Pclass column.

Step 6: Detecting Missing Values

Missing values can impact analysis quality. The isnull().sum() function helps identify columns with null entries.

 
# Check for null values
df.isnull().sum()

This reveals that ‘Age’ and ‘Cabin’ have missing values, which you’ll need to address for thorough analysis.

Step 7: Handling Missing Data

One way to address missing values is by replacing them with a specific value, such as 0.

 
# Replace missing values with 0
df.replace(np.nan, 0, inplace=True)

# Verify changes
df.isnull().sum()

This fills all null values with 0, though other methods like using the mean may be preferable depending on the context.

Step 8: Checking Data Types

Understanding data types is crucial, as it guides you in selecting appropriate analysis techniques for each attribute.

 
# Check data types of each column
df.dtypes

This function reveals each column’s data type, helping distinguish numerical from categorical data.

Step 9: Filtering the Dataset

Filtering allows you to analyze subsets of data based on specific criteria.

 
# Filter for first-class passengers
df[df['Pclass'] == 1].head()

This code returns rows where passengers are in the first class.

Step 10: Box Plot for Quick Visualization

Box plots are an effective way to examine the spread and detect outliers in numerical data.

 
# Box plot for the 'Fare' column
df[['Fare']].boxplot()

This gives a quick view of fare distribution, including any potential outliers.

Step 11: Correlation Matrix

The correlation matrix quantifies relationships between numerical features. You can visualize it for a more intuitive understanding.

 
# Correlation matrix
df.corr()

# Visualize the correlation matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

Positive correlations near 1 indicate strong relationships, while negative values close to -1 suggest inverse relationships.

Conclusion

Exploratory Data Analysis is a fundamental part of any data project. With these Python functions, you can achieve a comprehensive understanding of your dataset, helping you make informed decisions before advancing to more complex analyses. Integrating both graphical and non-graphical approaches offers a fuller perspective on your data.

Happy Analyzing!

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: