1.2 - Data Collection and Preparation

🔖

"Data Collection and Preparation" - Gather, clean, and remove duplicate data from various sources to make it ready for analysis.

We collected the data from kaggle and downloaded it on our local machine(computer). Import necessary libraries such as - pandas, and numpy and read the csv file from local machine using pandas library.

✍️

Pandas - used for manipulation and analysis library.
Numpy - supports large, multi-dimensional arrays and matrices, and high-level mathematical functions.

import pandas as pd
import numpy as np

# Load the dataset using 'read_csv'
employee_data = pd.read_csv('.../kaggle/employee-attritiion.csv')

# To see first 5 rows of data
employee_data.head()

# To know size of our data
employee_data.shape() # (74498, 24)

Data Preparation

In this step, we will find if the data has any missing values, duplicate values or errors and remove it.

# find missing values - see if data has any NaN values
employee_data.isnull().sum()

# output - 
# Employee ID  0
# Age          0
# Gender       0 etc...

# find duplicate values - see if values are repeated
employee_data.duplicated()
# or
employee_data[employee_data.duplicated()]
# returns all duplicated rows

In our case there are no duplicates and missing values.

Introduction to AI

Introduction to ML

ML Basics

MLOPS Basics

LLM

Generative AI

ML Projects

Project 01: Employee Attrition Prediction

Project 02: LLM using Hugging Face for Beginners

MLflow

1.2 - Data Collection and Preparation

Data Preparation

On this page