I remember back when I was fresh out of college I wanted to learn machine learning. At the time, I was a data analyst working primarily with SQL. I had a strong foundation in statistics but my coding skills were limited to college courses in Python, C++, and the occasional use of R and MATLAB. While exploring the possibility of advancing my role, I felt discouraged and inundated with buzz phrases – “big data,” “PhDs needed,” “working proficiency in CS,” even a strange language called ‘Pig” (look above). To me data science was a black box. But it doesn’t need to be! Everyone’s starting point is different so find the resources that work for you. This walkthrough is aimed at those who have completed a course in Python and have taken the initial steps to read and talk about data science.
Using RMDS’s 4E workflow, I will walk you through the steps of my first personal project in modeling: running Random Forest on Kaggle’s famous Titanic dataset.
Here are the steps of RM4E:
- Importing libraries
- Loading data
- Summary statistics
- Data visualization
- Data Processing, i.e. encoding, treating null values, standardizing
Binary Classification Algorithms like:Logistic RegressionK-Nearest NeighborsDecision TreeRandom ForestNaive BayesSupport Vector Machine
- Random Forest: Gini Impurity
- k-fold validation to estimate the accuracy of each model
- Confusion Matrix
- Accuracy, Recall, Precision, F1 Score