This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
| Segment | Topic |
|---|---|
| 1.1 | Workshop overview and Introductions |
| 1.2 | Introduction to Databricks |
| 1.3 | Hands-On: Databricks |
| 1.4 | Intro to Spark & DataFrames |
| 1.5 | Hands-On: DataFrames |
| 1.6 | Big Data Sets |
| 1.7 | Hands-On: Exploring Data |
Import the following URL: https://goo.gl/xoT2R7
Note: see also Zeppelin notebooks.
A different way of thinking about solving problems
32 + 12 + 23 + 4 + 7 + 21 + 19 + 32 + 3 + 11 + 88 + 23 + 1 + 93 + 5 + 28 = ?
32 + 12 + 23 + 4 + 7 + 21 = ?
19 + 32 + 3 + 11 + 88 = ?
23 + 1 + 93 + 5 + 28 = ?
------
?
Deer Bear Car Car Car River Deer Car Bear
sparkOnce you have a SparkSession (reminder: variable spark
in Databricks is a SparkSession), a DataFrame can be created from:
If using a list of tuples, include a list of column names; if using a list of values, specify value type:
df_from_other_list = spark.createDataFrame(
[('Chris',67),('Frank',70)], ['name','score'])
df_from_other_list.show()
from pyspark.sql.types import FloatType
df_from_list = spark.createDataFrame(
[1.0,2.0,3.0,4.0,5.0], FloatType())
df_from_list.show()
# read a specially formatted JSON file (one JSON object per line)
df = spark.read.json("/mnt/umsi-data-science/data/yelp/business.json")
# Displays the content of the DataFrame to stdout
df.show()
df.printSchema()
root |-- address: string (nullable = true) |-- attributes: struct (nullable = true) | |-- AcceptsInsurance: boolean (nullable = true) | |-- AgesAllowed: string (nullable = true) | |-- Alcohol: string (nullable = true) | |-- Ambience: struct (nullable = true) | | |-- casual: boolean (nullable = true) | | |-- classy: boolean (nullable = true) | | |-- divey: boolean (nullable = true) | | |-- hipster: boolean (nullable = true) | | |-- intimate: boolean (nullable = true) | | |-- romantic: boolean (nullable = true) | | |-- touristy: boolean (nullable = true) | | |-- trendy: boolean (nullable = true) | | |-- upscale: boolean (nullable = true) | |-- BYOB: boolean (nullable = true)
Spark can load a number of different formats: json, parquet, jdbc, orc, libsvm, csv, text
df = spark.read.load("examples/src/main/resources/people.json",
format=“json")
df.select("name").show()
+--------------------+ | name| +--------------------+ | Dental by Design| | Stephen Szabo Salon| |Western Motor Veh...| | Sports Authority| |Brick House Taver...| | Messina| ... only showing top 20 rows
# Select businesses with 4 or more stars
df.filter(df['stars'] >= 4).show()
# Count businesses by stars
df.groupBy("stars").count().show()
# Count businesses by stars and sort the output
df.groupBy("stars").count().sort("stars",ascending=False).show()
df = spark.createDataFrame([('Chris',[67,42]),('Frank',[70,72])],['name','scores'])
df.show()
+-----+--------+ | name| scores| +-----+--------+ |Chris|[67, 42]| |Frank|[70, 72]| +-----+--------+
df = df.withColumn('score',explode('scores')).show()
+-----+--------+-----+ | name| scores |score| +-----+--------+-----+ |Chris|[67, 42]| 67 | |Chris|[67, 42]| 42 | |Frank|[70, 72]| 70 | |Frank|[70, 72]| 72 | +-----+--------+-----+
import pyspark.sql.functions as F
df.withColumn('good', F.when(df['score'] > 50,1) \
.otherwise(0)) \
.show()
df.createOrReplaceTempView("businesses")
sqlDF = spark.sql("SELECT * FROM businesses")
sqlDF.show()
SELECT [Columns] FROM [Tables]
WHERE [Filter Condition]
ORDER BY [Sort Columns]
| Name | Description | Size | Bob Ross images | A collection of tags and images from the famed painter. | 438 images |
|---|---|---|
| Spam email | Email messages that have been labelled as spam (or not) | ? |
| Yelp | Businesses, reviews, and user data | ~20 GB |
Work in your teams to