MapReduce:
- allow computations to be parallelized over a cluster
- basic map-reduce
- the map_recue framework plans map tasks to be run on the correct partitions and shuffle data for the reduce function
- Map:
- Apply a function to each data over a portion of data in parallel
map (mapping data to cpu (workers))
Lambda function:
lambda x : x + 2
Check the number of partitions
rdd. getNumPartitions()
initialize
sc.textFile("file", number of partitions)
sc.parallize(list/array, numbner of partitions)
Transformation
- Perform functions against each element in an RDD and return a new RDD
- Construct a new RDD from existing RDD
- Doesn’t change the original RDD
- changing the data structure to sth new RDD
- Narrow Transformation:
- Transform data without any shuffles( only interact with each partitions )
- Wide Transformation:
- when the operations may require data shuffling(look through the whole partitions )
- Construct a new RDD from existing RDD
- lazy evaluation operations are evaluated when an action is requested
- track how many transformations happen
map(function):
Action:
- Trigger a computation and return a value to the Spark driver
- Aggregate (zeroValue, seqOp, combOp)
- fold
Paire RDDs:
- key value pairs commonly used for many operations including aggtegations, ETL(Extract, transform and upload) in Spark
- Allow operations on each key in parallel or regroup data across the network such as reduceBykey(), etc
Create pair RDD, map (lambda x: x.split(’,’)) => RDD key: could be a simple object(integer , string, etc. ) to complex objects (tuple, etc.) Value: could be a simple objects to data structures (lists, tuples, dictionarier, set, etc . )
Load
map(lambda X: [ int(x[0],x[1])])
Operations in Pair RDDs:
- Transformations;
- keyes() — return an RDD of just the keys
- values() - return an RDD of just the values
- sortByKey() - return an RDD sorted by the key
- groupbykey()
- group data use the key
- Return an RDD of (key, Resultlterable) pairs —shuffle
- MapValues(func)
- Pass each value in the key value pair RDD through a map function without changing the keys
- Retain the original RDD’s partitioning
- flattern key and value
-
flatMapValues (func) — shuffle
-
flat each key value pair RDD through a flatmap function without changing the keys
-
Retain the originial RDD’s partitioning
-
Reduce(lambda: x, y : x+y) —shuffle
-
- sortBy() -shuffle the data
- reduceby() will always faster than groupby() because it doesn’t require reshuffle
- Actions:
- countByKey()
- Count the number of elements for each key
-
lookup(key)
- countByKey()