This is a cheatsheet article with steps on data preprocessing required for deep learning endeavors.
Read in and slice
First read in dataset and choose relevant columns.
# Importing the dataset
dataset = pd.read_csv('dataset.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
Handle Nulls
Check if any nulls present and decide on those.
# Checks columns for nulls
X.isnull().sum()
X.fillna() # choose a way to fill nulls
X.dropna() # choose a way to drop nulls
X.drop('colname',axis=1) # possibly drop entirely some columns
Encode categorical data
Encode categorical, also hot encode single valued categoricals.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
# fits and transforms on categorical variables of 1st column - makes dummies
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
# Get rid of one of the dummies (the first one) if necessary - to avoid dummy trap.
X = X[:, 1:]
Split the dataset
Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)