Автор: Ernesto Lee
Издательство: Consultants Network
Год: 2020
Страниц: 118
Язык: английский
Формат: pdf, epub
Размер: 12.1 MB
This book is intended to provide an introduction to recommender systems using Apache Spark and Machine Learning. Before we begin with recommender systems using Apache Spark, we define Big Data and Machine Learning. We then dive directly into our use case of building a recommender system with Apache Spark and Machine Learning by showing you how to build a recommender system - step by step.
Apache Spark is an open source, fast and unified parallel large-scale data processing engine. It provides a framework for programming through distributed processing of large datasets at high speed. Spark supports most of the popular programming languages such as Java, Python, Scala and R. Spark is scalable, meaning, it can run on a single desktop machine or a laptop to a cluster of thousands of servers. Spark provides a set of built in libraries which can be accessed to perform data analysis over a large dataset. However, if your requirements exceed the capabilities present in the built in libraries, you can write one or explore countless external libraries from the myriad open source communities on the internet.
Why use Spark when we have Hadoop? Well, Spark excels as a unified platform for processing huge data at very high speeds for various data processing requirements. Spark is an in-memory processing framework. Spark is arguably mentioned as the successor of Apache Hadoop. Let us briefly discuss the advantages of Spark over Hadoop.
With the Hadoop ecosystem, we had various frameworks for various data processing requirements. As a developer, you could use the MapReduce framework to analyze your data using your choice of programming languages such as Java, C++, Python etc. However, a data warehouse engineer who is also a SQL expert, has to learn one of these aforementioned programming languages to leverage the MapReduce framework. To overcome this problem, a new framework which runs on the top of Hadoop called “Hive” was introduced. There was a similar problem for ETL processing and so “Pig” was introduced. Similarly, tools like “Giraph” and “Mahout” were introduced for Graphs processing and Machine Learning respectively. Moreover, Hadoop is only used for batch processing and there was no way to process data in real time. So, for this a new framework called “Storm” was integrated with Hadoop to work with streaming data. With so many frameworks, organizations had a tough time maintaining all the frameworks and tracking the issues pertaining to them. Fortunately, all this would change with advent of Spark. As mentioned, Spark is a unifying platform which provides all these frameworks as one package with four major components. Now, what actually does In-memory processing mean? Aren’t all the applications processed in memory only? Well, yes, all the applications are processed in-memory and written back to disk when processing is done, but Spark can process data in-memory and also retain the data within the memory or write to disk.
Скачать Hands-On Machine Learning Recommender Systems with Apache Spark