top of page

How to Merge and Filter Data Using Python Pandas

Efficient processing and transformation of data is essential in data analysis. To conduct insightful research, we frequently need to merge several datasets and extract pertinent information. For these jobs, the Python pandas package offers strong features like filtering and combining. Based on conditions, these processes enable data analysts to aggregate data from many sources and extract pertinent information. This blog post will go over how to merge and filter data in Pandas, when to utilize it, and different methods for doing so efficiently.


Computer screen with data grid, Python logo, server icons, and pandas text. Blue and gray tones, highlighting data analysis theme.

What is Merging?


Combining two or more datasets based on shared columns or indices is known as merging. This is comparable to utilizing a shared key to merge data from several tables in a SQL join.


An e-commerce dataset, for instance, might have separate tables for customer information and order history. We can track purchases, examine consumer behavior, and make data-driven decisions by combining these datasets.


Types of Merging in Pandas


1.      Inner Join

Only rows with matching values in both datasets are returned by an inner join. A row is eliminated if it lacks a matching entry in both datasets. When you need to know only what two tables have in common, like consumers who have bought something at least once.


2.      Left Join

Only matched rows from the right dataset and all rows from the left dataset are returned by a left join. The result is filled in with the missing values (NaN) if no match is found. If you want all of your customers' information when working with them, even if they haven't made any transactions.


3.      Right Join

Only the matching rows from the left dataset are returned by a right join, together with all the rows from the right dataset. In the dataset on the left, unmatched rows are filled with NaN. Even if certain customers' information is absent from the customer dataset, you might still wish to analyse all orders while examining sales statistics.


4.      Outer Join

Every row from both datasets is returned by an outer join. Missing values (NaN) are assigned when there is no match in a dataset. This guarantees that no information is lost. Even if some entries don't match, you still want a complete representation of both datasets.

Merging is essential when working with multiple data sources that contain complementary information. It allows analysts to create a unified dataset that can be used for deeper analysis.


How to Merge Data with pandas

The main function is:

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, ...)

  • left, right: DataFrames to merge.

  • how: type of merge — 'inner', 'left', 'right', 'outer'. Default is 'inner'.

  • on: Column or index level names to join on. Must be found in both DataFrames.

  • left_on, right_on: Columns from left/right DataFrame to join on if different.


Example Code:

import pandas as pd


# Sample customer data

customers = pd.DataFrame({

    'customer_id': [1, 2, 3, 4],

    'name': ['Alice', 'Bob', 'Charlie', 'David']

})


# Sample orders data

orders = pd.DataFrame({

    'order_id': [101, 102, 103, 104],

    'customer_id': [1, 2, 2, 5],

    'amount': [250, 300, 450, 150]

})


# Inner Join: Only customers with orders

inner_merged = pd.merge(customers, orders, on='customer_id', how='inner')

print("Inner Join:\n", inner_merged)


# Left Join: All customers, orders if exist

left_merged = pd.merge(customers, orders, on='customer_id', how='left')

print("\nLeft Join:\n", left_merged)


# Right Join: All orders, customer info if exists

right_merged = pd.merge(customers, orders, on='customer_id', how='right')

print("\nRight Join:\n", right_merged)


# Outer Join: All customers and all orders

outer_merged = pd.merge(customers, orders, on='customer_id', how='outer')

print("\nOuter Join:\n", outer_merged)

What is Filtering?


The process of removing particular rows from a dataset according to predetermined criteria is called filtering. When working with huge datasets, this is very helpful because analysis only requires a portion of the data.

For instance, you could wish to limit your dataset of millions of transactions to those from a certain location or those that total more than $5000.


Common Filtering Techniques


1. Filtering according to Just One Condition

Selecting rows where a column satisfies a particular criterion is an easy method of filtering data. For instance, extracting every sale with a value over $5000.

Finding high-value transactions in a financial dataset is the use case.

# Select rows where sale amount > 5000

filtered = orders[orders['amount'] >5000]


2. Using Multiple Conditions for Filtering

To further enrich the dataset, other conditions may need to be applied. Data can be filtered using multiple conditions with the use of logical operators like & (AND) and | (OR).

Use case: Extracting records pertaining to New York-based consumers who have spent more than $1000.

# Customers with State as New York & amount > 1000

filtered = orders[(orders['amount'] > 1000) & (orders['State'] == ''New York'')]


3. Filtering Using String Matching

String Matching Filtering Text-based filtering is helpful for categorical data, like excluding all consumers from a certain city or all items in a given category.

Use case: Extracting from a retail dataset all sales information pertaining to "Electronics" products.

filtered = sales[sales['Product'] == "Electronics"]


4. Applying Date Range Filtering

Filtering based on date ranges is essential for time-series analysis, as many datasets contain date columns. For example, you might wish to extract data for just the previous six months.

Use case: Filtering transactions from January 2023 to June 2023 in order to analyze sales patterns.

Filtering lowers noise and increases analysis efficiency by assisting analysts in concentrating on the most pertinent facts.

# Filter dates between Jan and June 2023

filtered_dates = orders[(orders['date'] >= '2023-01-01') & (orders['date'] <= '2023-06-30')]


When to Use Merging and Filtering?


Merging and filtering are fundamental techniques for handling structured data. They are widely used in:


  • Data Cleaning and Preprocessing: Preprocessing and data cleaning involve combining data from several sources and eliminating unnecessary items to create a clean dataset.




Data professionals may effectively organize, analyze, and extract insights from complex datasets by becoming proficient with pandas' merging and filtering features.


Conclusion

Two crucial data manipulation methods that enable analysts to easily combine several datasets and derive significant insights are merging and filtering. These methods offer strong tools for organizing and honing your data for analysis, regardless of whether you are working with machine learning datasets, sales records, financial transactions, or consumer data.


Consider enrolling in Iota's Data Analytics Course, where you will study pandas procedures in-depth and apply them to real-world case studies, if you want to improve your data manipulation abilities and work on real-world projects.

 

 

 

 

Comments


bottom of page