Are you interested in this course? Please let us know.
 Book nowWaitinglist
Prices are displayed without VAT by default.
  • Quick Contact Form

Data Science with Spark

This course will provide an understanding of Spark framework, RDD the core data structure of Spark and how to use them to transform data. The participants will gain understanding on how the data is stored and processed in a distributed manner. Participants will also understand the data science process, key machine learning technique and how to choose the right one for their use case. course will also involve understanding and applying MLlib library for implementing machine learning models.

Course Objectives

The following are the objectives of the course:

  • Overview of HDFS and YARN
  • What is Spark
  • How to use PySpark Shell and Jupyter Notebook
  • What is an RDD and how to operate on them
  • Distributed execution on Spark
  • Understand various Data formats and how to choose them
  • DataFrames and how to use them
  • Understand machine learning techniques such as regression, clustering, classification, ALS
  • How to use Spark MLlib to build machine learning models
  • Comparison with other options such as R, Python and Scala


The prerequisites for the course are as follows:

  • Understanding of Linux Commands
  • Basic programming in Python
  • Functional Programming Experience is a plus 
  • Knowledge of Big Data technologies is a plus


This course is suited for those who are interested to learn Spark and how to use it to do data science, apply machine learning models and understand various libraries available to do data science.

System Requirements

Below are system requirements - 

  • 64-bit Core 2 Duo processor or above
  • Min 8 GB RAM
  • At-least 20 GB free Hard disk space
  • USB Port enabled

Latest version of VMWare Player (on non-Mac) or VMWare Fusion (on Mac)