1.2 - Data Collection and Preparation
π
"Data Collection and Preparation" - Gather, clean, and remove duplicate data from various sources to make it ready for analysis.
We collected the data from kaggle and downloaded it on our local machine(computer). Import necessary libraries such as - pandas, and numpy and read the csv file from local machine using pandas library.
βοΈ
Pandas - used for manipulation and analysis library.
Numpy - supports large, multi-dimensional arrays and matrices, and high-level mathematical functions.
Numpy - supports large, multi-dimensional arrays and matrices, and high-level mathematical functions.
import pandas as pd
import numpy as np
# Load the dataset using 'read_csv'
employee_data = pd.read_csv('.../kaggle/employee-attritiion.csv')
# To see first 5 rows of data
employee_data.head()
# To know size of our data
employee_data.shape() # (74498, 24)
Data Preparation
In this step, we will find if the data has any missing values, duplicate values or errors and remove it.
# find missing values - see if data has any NaN values
employee_data.isnull().sum()
# output -
# Employee ID 0
# Age 0
# Gender 0 etc...
# find duplicate values - see if values are repeated
employee_data.duplicated()
# or
employee_data[employee_data.duplicated()]
# returns all duplicated rows
In our case there are no duplicates and missing values.