Big data and analytics are fast becoming must-have skills by companies all across the world. The technology is useful for the efficiency brought by harnessing data and using it for business decision making on operations, cost-saving initiatives, customer service, and profitability. Such Big data technologies include Hadoop, Apache Spark, Machine Learning and Data Mining.
Recruitment of professionally trained big data analysts has been a big challenge for human resource experts across the world. Professionals with the right skills and certification are rare and hard to come by.
Among Apache Spark Certifications that you should acquire if you are looking to increase your skills in big data and analytics, is the CCA175 Certification. This certification will also give you an advantage in the employment market or as a big data consultant
Essentially, the following steps are necessary to prepare and pass the CCA157 Spark and Hadoop Developer exam:
- Read everything you can about Spark & Hadoop
- Have a good understanding of how to execute HDFS Commands
- Learn how to move data between relational databases & HDFS using Sqoop
- Choose a programming language between Python and Scala
- Polish up on your SQL (HiveQL) skills
- Develop Spark-based applications using core APIs
- Integrate Spark SQL & data frames to Spark-based applications
- Learn how to stream data pipelines – Flume, Kafka and Spark Structured Streaming
- Take Practice Tests
The CCA175 Spark and Hadoop Developer certification exam
The CCA175 Certification is conducted by Cloudera and involves an exam on a variety of topics including Impala, Avro, Flume, HDFS, Spark with Scala and Python. The CCA175 exam is scenario-based, where you will have 2 hours to answer between 8 and 12 scenario questions, using tools like Impala or Hive, usually with some coding required.
In evaluating your score, Cloudera will look at your results and not the code itself with a minimum score of 70% required to earn the certification. You should expect your results within 3 days of the exam, and your certificate about a week after.
The CCA175 exam is available in either Scala or Python programming languages. It is a practical, hands-on exam that is administered remotely to all registered candidates and can be done anywhere in the world at any time of the available time slots. You will be able to get the available time slots when registering for the exam.
Preparing for the exam
This article is meant to be a preparatory guide before taking the CCA175 Spark and Hadoop Developer Certification exam. As this skill tests your proficiency in different programming languages and how good your code is, it will be important that you prepare well for the exam. I shall walk you through the objectives of the test, the skills outline and the resources that you need to stock up on. Taking any certification exam costs time and money. Once you have decided to take the exam, you must pass in the first sitting.
You should strive to learn effectively, and avoid time-wasting learning that will neither improve your skills nor your chances of excelling in this exam. This guide is expected to steer you on the right path, to ensure you get the best chance of passing the exam.
I recommend using ITversity’s courses on Udemy. I have included links to some of these courses in this article. You will be pleasantly surprised to find out how they have packed almost everything you will need to know in this course to ensure that you are properly equipped to tackle the CCA175 exam.
You need to enhance your proficiency in coding using some key languages, relevant to this exam. You should, therefore, take time to revise and upgrade your skills where necessary.
To take this exam you should have the following key programming skills;
- Sqoop – It is one of the Apache foundation projects, used for efficiently transferring bulk data. It is usually used to transfer data in the Hadoop to Relational database direction but can be used in the reverse direction too. You should, therefore, ensure you have a good understanding of how to use Sqoop.
- Spark – Obviously, this is the main celebrity here. You should have a good understanding and skills on how to code using Python or Scala.
- HiveQL or SQL – You should be confident about your ability to use Hive or SQL, and can write such scripts with relative ease.
Learning objectives and Skills Outline
Cloudera has listed three required skills for you to get the CCA175 certification on its website, you can look at those skills here:
|Transform, Stage, and Store||Convert a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS.|
1. Load data from HDFS for use in Spark applications.
2. Write the results back into HDFS using Spark.
3. Read and write files in a variety of file formats.
4. Perform standard extract, transform, load (ETL) processes on data using the Spark API
|Data Analysis||Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.|
1. Use metastore tables as an input source or an output sink for Spark applications.
2. Understand the fundamentals of querying datasets in Spark
3. Filter data using Spark.
4. Write queries that calculate aggregate statistics
5. Join disparate datasets using Spark
6. Produce ranked or sorted data
|Configuration||This is a practical exam and the candidate should be familiar with all aspects of generating a result, not just writing code.|
1. Supply command-line options to change your application configuration, such as increasing available memory
Transform, Stage, and Store
The first learning objective for the CCA175 certification is to transform, stage and store data. The Hadoop ecosystem uses a distributed file system known as Hadoop’s Distributed File System or HDFS.
By the end of your learning you should be able to do the following:
- Convert a set of HDFS data values into a new data set or format.
- Load data from HDFS, and use it for Spark applications.
- Use Spark to write back the data into HDFS.
- Read and write files data in different formats.
- Use a spark API to extract, transform and load data.
The second learning skill that will be tested is data analysis. You will be expected to understand how to use Spark applications to do data analysis, filter data, run calculation routines, and queries, join datasets and produce data in required formats.
You should, therefore, be able to use Spark SQL to interact with data and generate reports using queries against loaded data. You should also be able to use input metastore tables, spark applications, and query databases in Spark.
Some of the sources of information as you prepare to take this exam will be YouTube tutorials and lessons including some of the videos below.
In this video, you’ll get familiar with the Sqoop tool:
You’ll also need to get familiar with Apache Kafka as well. For that, you can read our article on how to prepare for the Apache Kafka certification exam (CCDAK). There also a nice video in the link explaining what Apache Kafka is and why do you need it for.
Upon earning the CCA175 certification, you should be able to comfortably configure and organize sets of data into different specs. As the exam is practical with scenario questions, you will be expected to be familiar with how to solve the given problems, as the scores will be on results and not the code.
You should also be able to change configurations using supply command-line options.
The three skills above will be mandatory for you to pass the exam. In this article, we shall further break down the skills with a few more details, for you to prepare for.
Preparing for the CCA175 Certification Exam
Some people are saying that you can fully prepare for this exam in only one month. But in reality, it’s not possible. Not even close. You will waste your money if sat for the exam unprepared. Make sure you go through as many resources as possible before booking your exam.
Read everything you can about Spark & Hadoop
Apache Spark is a distributed processing engine, providing APIs to facilitate distributed computing. It is not a programming language but is a cluster-computing framework. It is open-source, making it easily available and free for use.
You can get all the instructional material to learn about Spark and Hadoop in this Udemy course (CCA 175 – Spark and Hadoop Developer Certification – Scala).
If you will use Python to take this exam, then you should look at the CCA 175 – Spark and Hadoop Developer – Python (pyspark) course.
These courses are available in Udemy at an affordable cost and will teach you the full CCA175 Spark and Hadoop Developer curriculum. It will teach you about Apache Sqoop and how to execute HDFS Commands. Also included in the course content is programming with Scala or Python, with all the fundamentals provided.
YouTube also offers some great resources that you can use for free. Check out the video below to learn about Spark.
Some of the other places you could look online to learn about Hadoop and Spark are from the following links:
- https://blog.matthewrathbone.com/2016/09/01/a-beginners-guide-to-hadoop-storage-formats.html (An Introduction to Hadoop and Spark Storage Formats (or File Formats))
- https://www.oreilly.com/library/view/hadoop-application-architectures/9781491910313/ch01.html (Chapter 1. Data Modeling in Hadoop)
- https://www.youtube.com/watch?v=ziqx2hJY8Hg (Hadoop Tutorial: Intro to HDFS)
Have a good understanding of how to execute HDFS Commands
A basic requirement for you to understand to enable you to ace the CCA175 exam is HDFS commands. By the time you are doing the exam, you should be able to comfortably execute basic and frequently used Hadoop HDFS commands to perform file operations.
This site here provides some of the best tutorials from where you can learn how to do these commands.
Learn how to move data between relational databases & HDFS using Sqoop
Sqoop is used for data transfer between Hadoop and relational databases. You should be able to use the Sqoop tool to import and export data
Check out the tutorial below for a detailed view of what Sqoop is and how to use it.
Choose a programming language between Python and Scala
The CCA175 certification exam is available in either Scala or Python programming languages. You should, therefore, be comfortable with the use of at least one of these.
Check out the following link if you wish to use the Scala programming language:
CCA 175 – Spark and Hadoop Developer Certification – Scala
If you are more comfortable with using the Python programming language, then you can take the following course on Udemy:
CCA 175 – Spark and Hadoop Developer – Python (pyspark)
Polish up on your SQL (HiveQL) skills
This certification exam requires that you have a good understanding of SQL programming language. You should, therefore, polish up on your SQL skills, and learn how to structure databases, author and manage SQL databases and how to do data analysis with SQL. We recommend that you learn how to use Hive Query language.
Check out this link (Hive tutorials) that can help you to learn Hive SQL in 3 days.
Develop Spark-based applications using core APIs
In practicing for the CCA exam, you should be able to develop Spark-based applications using core APIs.
Spark will require a data structure for it to hold data. You can use either of two options; Dataset and Dataframe.
You should be able to perform transformation and action tasks for the dataset. Learning these tasks will be critical for you to be able to pass your exam.
Check out the following sites for some free tutorials that you can use to learn how to develop Spark-based applications using core APIs
Integrate Spark SQL & data frames to Spark-based applications
A great resource that you can use to learn how to integrate Spark SQL & data frames to Spark-based applications, is the following free tutorial from towardsdatascience.com.
Here you will get lessons on how to leverage the power of relational databases, using spark SQL and DataFrames. In greater detail, you will get to understand the challenges with scaling relational databases, understand Spark SQL and DataFrames, and get insights from an actual case study.
Learn how to stream data pipelines – Flume, Kafka and Spark Structured Streaming
Watch the video below to learn how you can:
- Develop end to end applications that read data from web server logs.
- Stream and connect into Kafka.
- Process data using Spark Streaming.
- Write data to HBase.
Again, if you want to make sure you know well the Apache Kafka part, check out our guide for learning Apache Kafka.
Take Practice Tests
As with all other skills, mastery requires practice.
Once you have gone through the notes and learned all the skills that you need for this course, you should embark on an intense practice regime. Remember that the CCA-175 examination tests scenario questions, and you shall be scored and rated based on the ultimate results, rather than the code that you use.
You need to be prepared to try as many practice questions as possible.
You should look at ITversity’s Udemy course here for some great practice questions, and lessons.
The scenario questions given in this course are similar to the ones that you will get during the CCA Exam. The course has 5 practice tests, with Practice tests 1 and 2, having 9 questions, Practice test 3 with 8 questions and practice test 4 with 6 questions. Also, it has 7 questions about Sqoop Import and export questions.
To attempt these practice questions, you need to install the Cloudera VM and have gone through the other Udemy course for your programming language of choice. You should be proficient with Sqoop, Hive, and Spark by this point.
The course also has sample solutions for the scenario questions. You should avoid the temptation to look at the suggested solutions before you try them.
You should attempt these and many other practice questions, under the same environment and conditions for the main exam. This means you do not refer to written material and use the regulated time to answer the questions.
All in all, the better you are at answering practice questions, the more prepared you will be to tackle the exam.
Registering for the CCA175 Certification Exam
We highly recommend that you do not register for the exam until you are thoroughly prepared for it. You should take these suggested courses, and use the resources to learn as much as possible about everything you need to know to get the certification.
Once you are confident that you have mastered the content of the course, then you can proceed and register for the examination.
How to register for the CCA175 Certification Exam:
To register for this exam,
- Log on to Cloudera using this link and click on the purchase link for CCA Spark and Hadoop Developer (CCA175). If you have never registered before at Cloudera, you will need to create a Cloudera Single Sign On (SSO) account before proceeding to register for the exam.
- Review all the details carefully and click on purchase, the certification costs $295, so you will need to have funded your account sufficiently to be able to process the payment.
- When registration is complete, you will receive an email with instructions about how to create an account with examslocal.com, to enable scheduling your exam.
- Follow the instructions, and you will have been booked for your examination slot. If you can’t see your preferred time slot, check whether other alternative time slots will be suitable for you.
Due to limited time slots, it will be best for you to register as early as possible, as time slots are given on a first-come, first-served basis.
You can also reschedule your exam, should you wish to. Again you will log onto examslocal.com and click on “my exams” where you will be guided appropriately on how to reschedule.
What else do you need to know
Take note of the following additional pointers when you have registered for the CCA175 Certification exam.
- To take the CCA175 examination, you will need to have a good computer with a stable internet connection. Your computer will require a built-in camera or a webcam and a microphone. You will also need Google Chrome with screen-sharing plugins installed.
- Install the Cloudera Quickstart VM in advance, and get accustomed to the tools and features.
- Familiarize yourself with the exam environment by watching the video in https://www.cloudera.com/about/training/certification/cca-spark.html
- You will be monitored for the duration of the exam by a Protractor. You will, therefore, have no access to additional resources from internet sites, saved notes or friends.
- You are also not allowed to take breaks during the exam, so you should ensure that you have eaten light and comfortable.
- Choose a comfortable, silent and well-lit room. As you choose your time slot, consider that you will need a peaceful atmosphere.
- Consider having back-up power and the internet as well for the duration of your exam, as power or internet mishaps that may occur from your end will not be excused.
If you are passionate about Big Data and are looking for a certification that will give you an edge over other candidates, then you should take the CCA175 Certification exam by Cloudera. You should know however that getting certification is not easy. It takes effort, it takes grit.
Thousands of IT professionals take the step towards CCA175 certification every year, but most of them fall short. You should, therefore, prepare adequately. The resources that we have shared in this article will get you a step closer to achieving that dream.
The Udemy courses that we have shared here are an affordable and sure way to prep yourself ahead of this exam.
If you prepare right, you will have no problem, and will shortly be getting your CCA175 certification.