Big Data, Predictive Analytics and Deep Learning with Apache Spark

Chris Teplovs, Ph.D.

Day 2

Workshop overview

Day 1:
Focus on data
Introductions to each other, the workshop, Big Data, Spark and Databricks
Day 2:
Focus on techniques
Clustering, classification and analytic pipelines
Day 3:
Focus on the future
Deep Learning, Neural networks, and project presentations

Day 1 (yesterday)

SegmentTopic
1.1Workshop overview and Introductions
1.2Introduction to Databricks
1.3Hands-On: Databricks
1.4Intro to Spark & DataFrames
1.5Hands-On: DataFrames
1.6Big Data Sets
1.7Hands-On: Exploring Data

Day 2 (today)

SegmentTopic
2.1Clustering Overview
2.2k-Means and Bisecting k-Means
2.3Hands-On: Clustering
2.4Classification Overview
2.5Hands-On: Classification
2.6Model Evaluation and Tuning
2.7Hands-On: Evaluation and Tuning

(Re)starting your Databricks cluster

  • notice that status of your cluster...
  • we want to start a new cluster
  • Databricks 4.0; python version doesn't matter (!)
  • watch libraries load (um, wow?)

Clustering

Cluster analysis

  • finds "interesting" groups of objects based on similarity
  • what typically makes a "good" clustering?
    • members are highly similar to each other (i.e. minimize within-cluster distances)
    • clusters are well-separated from each other (i.e. maximize between-cluster distances)

A "good" clustering solution

Applications of Cluster Analysis

  • understanding
    • group related documents for browsing
    • group genes and proteins that have similar functionality
    • group stocks with similar price fluctuations
  • summarization
    • reduce size of large data sets

Cluster Analysis Workflow

  1. Formulate the problem
  2. Select a distance measure (optional)
  3. Select a clustering procedure
  4. Decide on number of clusters
  5. Interpret and profile clusters
  6. Assess validity of clustering

Clustering: useful in exploratory data analysis

  1. Data understanding: finding underlying factors, groups, structure
  2. Data navigation: web search and browsing
  3. Data reduction: create new nominal variables
  4. Data smoothing: infer missing attributes from cluster neighbors

Clustering arises in many fields

  • Health
    • DNA gene expression (e.g. cancer, immunomarkers)
    • Medical imaging
  • Business
    • Market segments
    • Web site visitors
  • Social Network analysis
    • Find communities
  • Information retrieval
    • searh results clustered by similarity
    • personalization for groups of similar users
  • Speech understanding
    • convert waveforms to categories

Finding the "best" clustering

how many clusters?

Clustering algorithms

  • hard (objects belong to only 1 cluster) vs. soft (multiple membership)
  • hierarchical vs. non-hierarchical (flat)
  • agglomerative vs. divisive

We will focus on flat and hierarchical divisive methods: k-means and bisecting k-means

FYI: Agglomerative Hierarchical

  • produces a set of nested clusters organized as a hierarchical tree
  • can be visualized as a dendrogram:

k-means clustering

  • divisive clustering
  • each cluster has a centroid (center point)
  • each point is assigned to its nearest centroid
  • number of clusters (k) is specified in advance

k-means algorithm

k-means iterations

k-means notables

  • different initializations of centroids can yield different results
  • centroid is typically the mean of the points in the cluster (c.f. medoids)
  • proximity can be measured by Euclidean distance, cosine similarity, correlation, Manhattan distance, etc.
  • computationally complex (relatively speaking)

Limitations of k-means

  • k-means has problems when clusters are of differing sizes, densities, or are "oddly" shaped
  • outliers can cause problems

Bisecting k-means

  • hierarchical divisive technique
  • uses k-means with k=2 at each iteration

Bisecting k-means

Loop: until the stopping condition for the number of Clusters has
      been reached
     Loop: for every cluster         
        - Measure the total error for the parent cluster in this
          loop's iteration
        - Apply K-Means Algorithm to the cluster with k=2
        - Measure the total SSE error of the children clusters
          compared to their parent cluster         
        - Choose the cluster split that gives the lowest error and
          commit this split     
     End Loop
End Loop

Relative Merits of Bisecting k-means

  • computationally efficient (k-2)
  • resulting clusters tend to be stable

However. tends to produce different clusters than k-means.

Clustering in Spark

  • Why spark?
  • Why not scikit-learn?
  • Why not R?
  • Why focus on k-means and bisecting k-means?

How many clusters?

  • Theoretical, conceptual or practical issues many suggest a certain number of clusters
  • ratio of total within-group variance to between-group variance vs. number of clusters
  • look for "elbow" in resulting plot

How many clusters?

Good clusters?

  • stable across purturbations (different methods, e.g. distance metrics)
  • silhouette score (1 = good, -1 = really bad)

Silhouette plot

Silhouette score = summary of plot

To the notebook!

Break!

Classification

  • classification and classification types
  • algorithms
    • Naive Bayes
    • Decision Tree
    • Random Forest
  • Evaluation: Train/Test and Cross-Validation

Clustering vs. Classification

  • With clustering, we knew there was structure (e.g. different types of people, etc.), but we didn't know what the structure was
  • clustering is unsupervised
  • goal: find the structure
  • usually don't know which things go together in a cluster
  • may or may not know how many clusters
  • usually figure out what clusters mean after the fact

Clustering vs. Classification

  • Classification:
  • often supervised (or semi-supervised)
  • we know the labels of things (e.g. spam vs. non-spam, pop vs. classical)
  • computer "learns" rule(s) of where to put things
  • we don't know which features are best predictors of membership
  • usually know which things go together in a cluster
  • usually know how many clusters there are
  • usually know what the clusters "mean" in advance

Classification is about...

  • Answering or predicting something (Y) given an input (X):
  • Is X a Y (or not)? (e.g. is this email spam or not?)
  • What group (Y) does X belong to? (e.g. is this a forest or a mountain?)
  • What is the value of Y given X? (e.g. what grade should this student get?)

Classifiers work by...

Being fed examples and learning how important certain features arises

Classification workflow

  1. Generate/obtain labels
  2. Generate/obtain features
  3. Select a classifier
  4. Train classifier
  5. Tune classifier
  6. Test classifier

Getting labels

  • Painful, expensive, time-consuming
  • Often human labor
  • Can infer labels (e.g. predict gender by examining name of author)
  • Synthetic datasets

Features

  • Same ideas as in clustering
  • Some set of "descriptions" for an object: explicit and inferred (calculated)

Clustering vs. Classification

  • Clustering tries to separate groups by using (dis)similarity
  • Classification tries to find important features for distinguishing group membership

Some popular classifiers

  • k-Nearest Neighbor (kNN)
  • Logistic Regression
  • Naive Bayes
  • Decision Tree
  • Random forest
  • SVM
  • Neural Networks ("deep learning")

Our focus: Decision Trees and Random Forests

Decision Trees

I'm thinking of an animal: ask me some yes/no questions.

Decision Trees

  • Ask the question with the most valuable answer first
  • "If I knew the answer to this, how much closer to the solution would I be?"
  • Solutions that divide the space 50/50 are better than solutions that divide the space 98/2

Decision Tree Advantages

  • easy to Interpret
  • prediction process is obvious
  • can handle mixed data types

Decision Tree Limitations

  • expensive to calculate
  • tendency to overfit
  • can get large

Random Forest

  • currently a favorite technique
  • can fix the problem of one tree by using many
  • various ways to randomize: pick different subsets of data, pick different features

How do you visualize a random forest?

Classification Summary

  • ubiquitous
  • possibly dangerous
  • good for when we know somerthing about the structure
  • same type of "pipeline" as clustering

To the notebook!

Workshop Overview

Day 1

Day 2

Day 3