Posts

Deep-dive into Spark internals and architecture

Image
      Apache Spark   is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM process that’s running a user code using the spark as a 3rd party library. As part of this blog, I will be showing the way Spark works on Yarn architecture with an example and the various underlying background processes that are involved such as: Spark Context Yarn Resource Manager, Application Master & launching of executors (containers). Setting up environment variables, job resources. Coarse Grained Executor Backend & Netty-based RPC. Spark Listeners. Execution of a job (Logical plan, Physical plan). Spark-WebUI. Spark Context Spark context is the first level of entry point and the heart of any spark application.  Spark-shell  is nothing but a Scala-based REPL with spark binaries which will create an object sc called spark context. We can launch the spark shell as shown below: spark-shell --master yarn \ --conf spark.ui.port=12345 \ --num-executors 3