Apache Spark for Data Engineering and Machine Learning
About this Course
Apache® Spark™ is a fast, flexible, and developer-friendly open-source platform for large-scale SQL, batch processing, stream processing, and machine learning. Users can take advantage of its open-source ecosystem, speed, ease of use, and analytic capabilities to work with Big Data in new ways. In this short course, you explore concepts and gain hands-on skills to use Spark for data engineering and machine learning applications. You'll learn about Spark Structured Streaming, including data sources, output modes, operations. Then, explore how Graph theory works and discover how GraphFrames supports Spark DataFrames and popular algorithms. Organizations can acquire data from structured and unstructured sources and deliver the data to users in formats they can use. Learn how to use Spark for extract, transform and load (ETL) data. Then, you'll hone your newly acquired skills during your "ETL for Machine Learning Pipelines" lab. Next, discover why machine learning practitioners prefer Spark. You'll learn how to create pipelines and quickly implement features for extraction, selections, and transformations on structured data sets. Discover how to perform classification and regression using Spark. You'll be able to define and identify both supervised and unsupervised learning. Learn about clustering and how to apply the k-mean s clustering algorithm using Spark MLlib. You'll reinforce your knowledge with focused, hands-on labs and a final project where you will apply Spark to a real-world inspired problem. Prior to taking this course, please ensure you have foundational Spark knowledge and skills, for example, by first completing the IBM course titled "Big Data, Hadoop and Spark Basics."Created by: IBM
Level: Intermediate
Related Online Courses
Port cities are dynamic environments. They face ever-changing challenges and demands from port activities under continually evolving economic and environmental circumstances. They also offer a rich... more
The modern data analysis pipeline involves collection, preprocessing, storage, analysis, and interactive visualization of data. The goal of this course, part of the Analytics: Essential Tools and... more
Are you in a technical role and want to learn the fundamentals of AWS? Do you aspire to have a job or career as a cloud developer, architect, or in an operations role? If so, this course is an... more
Digital Design is about designing in digital space so the created content can be displayed and seen on a digital device. With the availability of high computing power, designers can quickly create... more
Many engineers are puzzled by questions such as: how to shift or reduce peak heating demand to obtain a better match with a smart grid or renewable energy system? What is thermally more efficient:... more