Note: This course is built on top of the “Real World Vagrant - Build an Apache Spark Development Env- Toyin Akin” course. So if you do not have a Spark environment already installed (within a VM or directly installed), you can take the stated course above. Sparks pythonshell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in Python. Start it by running the following anywhere within a bash terminalwithin the built Virtual MachinepysparkSparks primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from collections, Hadoop InputFormats (such as HDFS files) or by transforming other RDDsSpark Monitoring and InstrumentationWhile creating RDDs, performing transformations and executing actions, you will be working heavily within the monitoring view of the Web UI. EverySparkContextlaunches a web UI, by default on port 4040, that displays useful information about the application. This includes:A list of scheduler stages and tasksA summary of RDD sizes and memory usageEnvironmental information. Information about the running executorsWhy Apache Spark. Apache Sparkrun programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Sparkhas an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Apache Sparkoffers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Apache Sparkcan combine SQL, streaming, and complex analytics. Apache Sparkpowers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.