The Best Python Pandas Tutorial

Python Pandas is one of the most widely-used libraries in data science and analytics. It offers high-performance, user-friendly data structures and tools for data analysis. In Pandas, two-dimensional table objects are called DataFrames, while one-dimensional labeled arrays are known as Series. A DataFrame is a structure that includes both column names and row labels.

Dive Deep into Core Python Concepts

Python Certification CourseENROLL NOW
Dive Deep into Core Python Concepts

What Is Python Pandas?

Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work on structured data seamlessly and efficiently. Developed by Wes McKinney in 2008, Pandas is built on top of the NumPy library and is widely used for data wrangling, cleaning, analysis, and visualization.

What Is Pandas Used For?

Pandas is extensively used for:

  • Data Cleaning: Handling missing values, duplications, and incorrect data formats.
  • Data Manipulation: Filtering, transforming, and merging datasets.
  • Data Analysis: Performing statistical analysis and aggregations.
  • Data Visualization: Creating plots and charts to visualize data trends and patterns.
  • Time Series Analysis: Handling and manipulating time series data.

Master Web Scraping, Django & More!

Python Certification CourseENROLL NOW
Master Web Scraping, Django & More!

Key Benefits of the Pandas Package

  1. Ease of Use: Pandas offers an intuitive syntax and rich functionality, making data manipulation and analysis straightforward, even for those new to programming.
  2. Efficiency: Built on top of NumPy, Pandas is optimized for performance with large datasets, providing fast and efficient data manipulation capabilities.
  3. Versatility: Pandas supports a wide range of data formats, including CSV, Excel, SQL databases, and more, allowing seamless integration with various data sources.
  4. Robust Data Structures: The library provides powerful data structures, such as Series and DataFrame, which are essential for handling structured data flexibly and efficiently.
  5. Comprehensive Functionality: Pandas includes numerous methods for data cleaning, transformation, and analysis, such as handling missing values, merging datasets, and grouping data.
  6. Time Series Support: Pandas has robust support for time series data, including easy date range generation, frequency conversion, moving window statistics, and more.
  7. Data Alignment: Automatic data alignment and handling of missing data simplify the process of working with incomplete datasets.
  8. Integration with Other Libraries: Pandas seamlessly integrates with other popular Python libraries, such as Matplotlib for data visualization and Scikit-Learn for machine learning.
  9. Active Community and Documentation: Pandas has a large and active community, extensive documentation, and numerous tutorials and resources, making it easier for users to find help and learn best practices.
  10. Open Source: As an open-source library, Pandas is free to use and continuously improved by contributions from the global data science community.

How to Install Pandas?

Installing Pandas is a straightforward process that can be done using Python's package manager, pip. Follow these steps to install Pandas on your system:

Step 1: Verify Python Installation

Ensure that Python is installed on your system. You can check this by running the following command in your command prompt or terminal:

python --version

Step 2: Open Command Prompt or Terminal

Open your command prompt (Windows) or terminal (MacOS/Linux).

Step 3: Install Pandas Using pip

Run the following command to install Pandas:

pip install pandas

This command will download and install the latest version of Pandas along with its dependencies.

Step 4: Verify the Installation

After the installation is complete, you can verify that Pandas is installed correctly by opening a Python shell and importing Pandas:

import pandas as pd

print(pd.__version__)

If Pandas is installed correctly, this will print the version of Pandas you have installed.

Elevate your coding skills with Simplilearn's Python Training! Enroll now to unlock your potential and advance your career.

Pandas Series

A Pandas Series is a one-dimensional labeled array capable of holding any data type. It is similar to a column in a spreadsheet or a SQL table.

import pandas as pd

# Creating a Series

data = [1, 2, 3, 4, 5]

series = pd.Series(data)

print(series)

Seize the Opportunity: Become a Python Developer!

Python Certification CourseENROLL NOW
Seize the Opportunity: Become a Python Developer!

Basic Operations on Series

You can perform various operations on Series, such as arithmetic operations, filtering, and statistical calculations.

# Arithmetic Operations

series2 = series + 10

print(series2)

# Filtering

filtered_series = series[series > 2]

print(filtered_series)

# Statistical Calculations

mean_value = series.mean()

print(mean_value)

Pandas Dataframe

A Pandas DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

# Creating a DataFrame

data = {

    'Name': ['Alice', 'Bob', 'Charlie'],

    'Age': [25, 30, 35],

    'City': ['New York', 'Los Angeles', 'Chicago']

}

df = pd.DataFrame(data)

print(df)

Basic Operations on Dataframes

DataFrames support a wide range of operations for data manipulation and analysis.

# Accessing Columns

print(df['Name'])

# Adding a New Column

df['Salary'] = [70000, 80000, 90000]

print(df)

# Dropping a Column

df = df.drop('City', axis=1)

print(df)

Python Pandas Sorting

Sorting data is a fundamental aspect of data analysis. In Pandas, you can sort your data based on the values in one or more columns or by the DataFrame's index. This capability allows you to organize and analyze your data more effectively.

Sorting by Values:

To sort a DataFrame by the values of a specific column, you use the sort_values method.

import pandas as pd

# Sample DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'],

        'Age': [25, 30, 35],

        'Salary': [70000, 80000, 90000]}

df = pd.DataFrame(data)

# Sorting by 'Age'

sorted_df = df.sort_values(by='Age')

print(sorted_df)

Sorting by Index:

You can also sort your DataFrame by its index using the sort_index method.

# Sorting by Index

sorted_df_index = df.sort_index()

print(sorted_df_index)

Both methods allow for ascending or descending order sorting by setting the ascending parameter to True or False.

Python Pandas Groupby

The groupby method in Pandas is a powerful tool that allows you to group data based on one or more columns and perform aggregate operations on those groups. This is particularly useful for summarizing data and gaining insights into different subsets of your data.

Grouping and Aggregating:

Here's how you can use groupby to group data and perform aggregation operations like sum, mean, or count.

# Sample DataFrame

data = {'Department': ['HR', 'Finance', 'HR', 'Finance', 'HR'],

        'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],

        'Salary': [50000, 60000, 70000, 80000, 90000]}

df = pd.DataFrame(data)

# Grouping by 'Department' and summing the 'Salary'

grouped = df.groupby('Department')['Salary'].sum()

print(grouped)

The groupby method returns a GroupBy object, which can then be aggregated using various functions like sum, mean, count, etc.

Skyrocket Your Career: Earn Top Salaries!

Python Certification CourseENROLL NOW
Skyrocket Your Career: Earn Top Salaries!

Python Pandas: Merging

Merging is a crucial operation that allows you to combine two DataFrames based on a common column or index. Pandas provides the merge function for this purpose, which is similar to SQL joins.

Merging DataFrames:

# Sample DataFrames

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})

df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [2, 3, 4]})

# Merging on 'key' column

merged_df = pd.merge(df1, df2, on='key')

print(merged_df)

You can specify the type of join (inner, outer, left, right) using the how parameter.

# Outer Join

outer_merged_df = pd.merge(df1, df2, on='key', how='outer')

print(outer_merged_df)

Python Pandas: Concatenation

Concatenation is the process of appending DataFrames along a particular axis (rows or columns). Pandas' concat function allows you to concatenate two or more DataFrames.

Concatenating DataFrames:

# Sample DataFrames

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})

# Concatenating along rows

concat_df = pd.concat([df1, df2])

print(concat_df)

You can also concatenate along columns by setting the axis parameter to 1.

# Concatenating along columns

concat_df_col = pd.concat([df1, df2], axis=1)

print(concat_df_col)

Data Visualization With Pandas

Data visualization is crucial to data analysis, allowing you to see patterns, trends, and outliers in your data. Pandas integrates well with Matplotlib, making creating various plots directly from your DataFrame easy.

Plotting Data:

import matplotlib.pyplot as plt

# Sample DataFrame

data = {'Year': [2017, 2018, 2019, 2020, 2021],

        'Sales': [250, 300, 400, 350, 500]}

df = pd.DataFrame(data)

# Plotting a line graph

df.plot(x='Year', y='Sales', kind='line')

plt.xlabel('Year')

plt.ylabel('Sales')

plt.title('Yearly Sales')

plt.show()

Pandas supports various plot types, including line plots, bar plots, histograms, and more. You can effectively communicate your data insights and findings by leveraging these visualization capabilities.

Elevate your coding skills with Simplilearn's Python Training! Enroll now to unlock your potential and advance your career.

Conclusion

Pandas is an essential tool for data scientists and analysts. Its powerful data structures and comprehensive functionality make it the go-to library for data manipulation, analysis, and visualization in Python. By mastering Pandas, you can handle and analyze data more efficiently, leading to more insightful and actionable results.

Unlock the power of Python, one of the most versatile and in-demand programming languages, with the comprehensive Python Training course by Simplilearn. Whether you're a beginner looking to start your programming journey or an experienced professional aiming to enhance your skills, our course is designed to cater to your learning needs.

FAQs

1. What are the main data structures in Pandas?

The main data structures in Pandas are Series and DataFrame. A Series is a one-dimensional labeled array capable of holding any data type. A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). These structures provide the foundation for data manipulation and analysis in Pandas.

2. How do I select a column in a DataFrame?

To select a column in a DataFrame, you can use either the bracket notation or the dot notation. For example, if you have a DataFrame df and want to select the column named "Age":

age_column = df['Age']  # Bracket notation

age_column = df.Age     # Dot notation

Both methods return a Series containing the data from the specified column.

3. How do I handle missing values in a DataFrame?

Pandas provides several methods to handle missing values. You can use dropna() to remove rows or columns with missing values, or fillna() to replace them with a specified value. For example:

df_cleaned = df.dropna()         # Removes rows with any missing values

df_filled = df.fillna(0)         # Replaces all missing values with 0

df['Age'].fillna(df['Age'].mean(), inplace=True)  # Replaces missing values in 'Age' with the column's mean

4. How do I group data in a DataFrame?

To group data in a DataFrame, use the groupby() method. This method groups the data based on one or more columns and allows you to apply aggregate functions to each group. For example:

grouped = df.groupby('Department')

sum_salary = grouped['Salary'].sum()  # Sum of 'Salary' for each department

The groupby() method returns a GroupBy object, which can then be aggregated using functions like sum(), mean(), count(), etc.

About the Author

Ravikiran A SRavikiran A S

Ravikiran A S works with Simplilearn as a Research Analyst. He an enthusiastic geek always in the hunt to learn the latest technologies. He is proficient with Java Programming Language, Big Data, and powerful Big Data Frameworks like Apache Hadoop and Apache Spark.

View More
  • Acknowledgement
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.