Автор: Mahmoud Parsian
Издательство: O’Reilly Media, Inc.
Год: 2021-09-10
Страниц: 390
Язык: английский
Формат: epub
Размер: 10.1 MB
Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.
Why should we use Spark? Spark is a powerful analytics engine for large scale data processing. The most important reasons for using Spark are listed:
• Spark is simple, powerful, and fast (uses RAM rather than disk — Spark runs workloads 100x faster.)
• Spark is open-source, free, and can solve any big data problem
• Spark runs everywhere (Hadoop, Mesos, Kubernetes, standalone, or in the cloud).
• Spark can read/write data from/to many data sources
• Spark can read/write data in row-based and column-based (such as Parquet and ORC) formats
In a nutshell, Spark unlocks the power of data by handling big data with power, ease of use, and speed. Spark is one of the best choices for large-scale data processing and for solving MapReduce problems and beyond. Spark unlocks the power of data by handling big data with powerful APIs and speed. Using MapReduce/Hadoop to solve big data problems is complex and you have to write ton of low level code to solve primitive problems — this is where he power and simplicity of Spark comes in to solve complex big data problems. Apache Spark is much faster than Apache Hadoop because it uses in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph algorithms, and SQL queries.
Spark’s “native” language is Scala, but you can use language APIs to run Spark code from other programming languages (for example, Java, R, and Python). In this book, I teach you how to use PySpark to solve big data problems in Spark. In this book, you will learn how to solve your big data problems in Spark by expressing your solution in PySpark. You will lean how to read your data and represent it as an RDD and DataFrame. RDD is a fundamental data abstraction of Spark. DataFrame (a distributed table of rows with named columns) in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Once your data is represented as an RDD or a DataFrame, then you may apply transformation functions (such as mappers, filters, reducers) on them to transform your data to your desired form. I have presented many Spark transformations, which can be used to solve your ETL, analysis, and data intensive computations.
Скачать Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark (Fourth Early Release)