Distributed Computing Main
Step up:
conda env --info
Conda env export -no-builds > name.yml (export the environment file to share and reprodcue the current enrironments including all the packages with corresponding versions)
Big Data:
Size and structure is beyond the ability of traditional data-processing application software can adequately handle:
Example domains:
- internet of things
- health catre
- marketing
- size: constantly moving target - as of 202 ranging from terabytest
- Structure: Mostly unstructured Storing and Processing Big Data became an important issue, need to process data faster and more structure
Definition:
For Processing large volumes of data fast:
- “Scale out” instead of scale up.
- cheaper: Run large data on clusters of many smaller and cheaper machines( -> )
- Reliable(Fault Tolerant): if one node or process fails, its workload should be assumed by other components in the system
- Faster: it parallelizes and distributed computation the # of threads that can run on each core means how many partitions (unit of work) can run
What is Distributed Computing
Why Distributed Computing
Spark will utilize all the cores