Machine Learning Lectures: Steps 5-7

“Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming.”
Chris Lynch

S5: Algorithm Selection

S6: Is Training AI Easy

S6: Ideal Dataset Size

  • Algorithms Selection:
    • Support Vector Classifier
      • straight link
    • Decision Tree
      • horizontal / vertical
    • Neural Network
      • curvilinear 
  • ML Research
    • Finding Patterns
  • ML Application
    • Assessing models 
  • Throw away methods that don't meet your needs
  • Review of steps
  • Training -- Step 6
  • How many features should you use? 
    • Worst: lots of features / little instances
    • Better: only a subset of features
    • Better: more instances
    • Best: features / instances
  • Length to Width
    • guess about 10 to 1
  • Dimensionality Reduction / Feature Reduction
  • PCA

S6: Training Faster

S6: Statistics vs "Statistics"

S6: Overfitting 

  • Use prototyping tools
  • Start with a smaller dataset to see if tinkering is worth doing
  • Statistics: philosophical pursuit vs. trying it
  • Tinker till it fits
  • Fit is to perform well on an objective
  • Make it fit!
  • Training + Tuning
    • fit, overfit, underfit, mess
  • Validation: Pass / Fail
  • Testing: 
  • Model complexity encourages overfitting

S6: Complexity and Regularization

S6: Tempting Features

S6: Skip Training Phase 

  • Regularization: focus on simplicity
  • Penalties for errors and complexity
  • Avoid training on data from the future
  • Avoid training on features that cannot be used in production
  • Your goal is to find patterns in the data 

S7: Debugging Model

S7: Hyper-parameter Tuning

S7: What is a Holdout Set?

  • Debugging / Tuning
  • Need its own dataset
    • pre-save or take out of training data
  • Check performance in debugging data
    • what instances model got wrong
  • Look at things that went wrong
    • see if there is something that should be added
  • Tuning: Hyperparameters - numerical settings in an algorithm
  • Hyperparameters are set before the algorithms runs 
  • Parameters are set using the data 
  • Use a "for" loop

S7: Cross Validation

S7: Advanced Debugging

S7: Can You Skip Tuning?

  • Cross validation: Cross Tuning
  • k-fold cross validation
    • k is the number of non-overlapping pieces
    • train, evaluate then store
    • move to the next setting
  • Aggregated performance 
  • Result is tuned model
  • Check model stability
    • debug inside of the model
  • Enough data?
  • Overfitting?
  • Susceptible to outliers 
  • Tuning tends to. be more important later in ML