Distributed Computing Main


Step up:

conda env --info 
Conda env export -no-builds > name.yml (export the environment file to share and reprodcue the current enrironments including all the packages with corresponding versions)

Big Data:

Size and structure is beyond the ability of traditional data-processing application software can adequately handle:

Example domains:

  • internet of things
  • health catre
  • marketing
  • size: constantly moving target - as of 202 ranging from terabytest
  • Structure: Mostly unstructured Storing and Processing Big Data became an important issue, need to process data faster and more structure

Definition:

For Processing large volumes of data fast:

  • “Scale out” instead of scale up.
  • cheaper: Run large data on clusters of many smaller and cheaper machines( -> )
  • Reliable(Fault Tolerant): if one node or process fails, its workload should be assumed by other components in the system
  • Faster: it parallelizes and distributed computation the # of threads that can run on each core means how many partitions (unit of work) can run

What is Distributed Computing

Why Distributed Computing

Spark will utilize all the cores


Main:

Spark Spark documentation


Key words:


TAGS