Environment:
- Keep dependencies required by different project in separate places.
- Easily switch to an environment that requires different dependencies Create an environment:
conda create --name myenv -y
conda create -f file.yml -n myenv # create an enviroment from yml file and name myenv
conda env update -f file.yml -n myenv # update myenv base on file.yml
conda activate env # activate the enviroment
conda info --envs # check availbe envs
conda deactivate # deactivating the current environment and return to base environment
conda env remove --name myenv --all # remove the environment(after deactivate)
MapReduce
Map-Reduce: Allow computations to be parallelized over a cluster.
- Basic Map-Reduce
- Distribution : Distribute the data.
- Parallelism : Perform subsets of the computation simultaneously.
- Fault Tolerance : Handle component failure.
- The map-reduce framework plans task to run the correct partitions and shuffle data for the reduce function
- Map: Apply a function to each data over a portion of data in parallel
- Reduce: return one value from multiple values
Hadoop MapReduce vs. Spark
Extends the MapReduce model with primitives for efficient data sharing (Using Resilient Distributed Datasets (RDDs))
input = sc.textfile("../data...", 8),
rdd = sc.parallelize(data)
if just do input.collect(), it will return all the entire data in a list do input.glom().collect()
RDD operations;
Two types
◦ Transformation
◦ Perform functions against each element in an RDD and return a new RDD. Construct a new RDD from an existing RDD,Doesn’t change the original RDDs
```
splitted_input_rdd = input_rdd.map(lambda x :x.split('\t')
◦ Lazy evaluation – operations are only evaluated when an action is requested.
◦ Action
◦ Trigger a computation and return a value to the Spark driver.