Machine Learning Lectures: Steps 2-4
“Data is the new science. Big Data holds the answers.”
– Pat Gelsinger
Steps 2-4: Get The Right Data
- Step 2: Get data
- Step 3: Split data
- Step 4: Explore the data
S2: Data Engineering
S3: Overfitting Danger
S3: What Is Underfitting
- Getting Data
- You are responsible to tell it what data
- Engineers tell the algorithms what data to look at
- Split the data
- avoid RMSE to absolute min
- Generalization, not memorization!
- Overfitting: happens when you model noise instead of reality
- You need fresh data for checking performance.
- This is why you split your data!
- Underfitting: happens when you insist on using a model that is too simple
- Let your data speak
- underfitting is easy to see the damage
- overfitting is harder to detect
S3: Data Splitting
S4: Exploratory Data Analysis
- Data Splitting: required for machine learning
- Three data sets unlocks the ability to do machine learning properly - 1) exploratory data, 2) check / test the data set, 3) soft interim check, 4) modern
- Exploratory data set is split into training data (to check the model) and validation data (practice problems or do overs)
- Exploratory data analysis (EDA)
- Only look at the training data
- Do analytics on the dataset first
- expert with analyst
- check the data - develop algorithms to clean the data