exam 1 review

Environment:

Keep dependencies required by different project in separate places.
Easily switch to an environment that requires different dependencies Create an environment:

conda create --name myenv -y

conda create -f file.yml -n myenv # create an enviroment from yml file and name myenv

conda env update -f file.yml -n myenv # update myenv base on file.yml




conda activate env # activate the enviroment 

conda info --envs # check availbe envs

conda deactivate # deactivating the current environment and return to base environment

conda env remove --name myenv --all # remove the environment(after deactivate)

MapReduce

Map-Reduce: Allow computations to be parallelized over a cluster.

Basic Map-Reduce
- Distribution : Distribute the data.
- Parallelism : Perform subsets of the computation simultaneously.
- Fault Tolerance : Handle component failure.
- The map-reduce framework plans task to run the correct partitions and shuffle data for the reduce function
- Map: Apply a function to each data over a portion of data in parallel
- Reduce: return one value from multiple values

Hadoop MapReduce vs. Spark Extends the MapReduce model with primitives for efficient data sharing (Using Resilient Distributed Datasets (RDDs))

input = sc.textfile("../data...", 8),

rdd = sc.parallelize(data)

if just do input.collect(), it will return all the entire data in a list do input.glom().collect()


RDD operations;
Two types  
◦ Transformation  
	◦ Perform functions against each element in an RDD and return a new RDD.  Construct a new RDD from an existing RDD,Doesn’t change the original RDDs
		```
		splitted_input_rdd = input_rdd.map(lambda x :x.split('\t')

◦ Lazy evaluation – operations are only evaluated when an action is requested.

◦ Action
◦ Trigger a computation and return a value to the Spark driver.

🪴 Freeman's Second Brain

Explorer

exam 1 review

MapReduce

Graph View

Backlinks

🪴 Freeman's Second Brain

Explorer

exam 1 review

MapReduce §

Graph View

Backlinks

MapReduce