Machine Learning Lectures: Steps 2-4

“Data is the new science. Big Data holds the answers.”
Pat Gelsinger

S2: Data Engineering

S3: Overfitting Danger

S3: What Is Underfitting

  • Getting Data
  • You are responsible to tell it what data
  • Engineers tell the algorithms what data to look at
  • Split the data
    • avoid RMSE to absolute min
  • Generalization, not memorization!
  • Overfitting: happens when you model noise instead of reality 
  • You need fresh data for checking performance. 
    • This is why you split your data! 
  • Underfitting: happens when you insist on using a model that is too simple
  • Let your data speak
  • underfitting is easy to see the damage
  • overfitting is harder to detect

S3: Data Splitting

S4: Exploratory Data Analysis

  • Data Splitting: required for machine learning
  • Three data sets unlocks the ability to do machine learning properly - 1) exploratory data, 2) check / test the data set, 3) soft interim check, 4) modern
  • Exploratory data set is split into training data (to check the model) and validation data (practice problems or do overs)
  • Exploratory data analysis (EDA)
  • Only look at the training data
  • Do analytics on the dataset first
    • expert with analyst
    • check the data - develop algorithms to clean the data