Автор: Sev Leonard
Издательство: O’Reilly Media, Inc.
Год: 2023
Страниц: 286
Язык: английский
Формат: epub (true), mobi
Размер: 10.2 MB
The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?
With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.
When working with Spark, the Spark UI provides additional diagnostic information regarding executor load, how well balanced (or not) your computation is across executors, shuffles, spill, and query plans, showing you how Spark is running your query. This information can help you tune Spark settings, data partitioning, and data transformation code.
By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:
Reduce cloud spend with lower cost cloud service offerings and smart design strategies
Minimize waste without sacrificing performance by rightsizing compute resources
Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
Set up development and test environments that minimize cloud service dependencies
Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution
Improve data quality and pipeline operation through validation and testing
Who This Book Is For:
I’ve geared the content toward an intermediate to advanced audience. I assume you have some familiarity with software development best practices, some basics about working with cloud compute and storage, and a general idea about how batch and streaming data pipelines operate. This book is written from my experience in the day-to-day development of data pipelines. If this is work you either do already or aspire to do in the future, you can consider this book a virtual mentor, advising you of common pitfalls and providing guidance honed from working on a variety of data pipeline projects.
If you’re coming from a data analysis background, you’ll find advice on software best practices to help you build testable, extendable pipelines. This will aid you in connecting analysis with data acquisition and storage to create end-to-end systems. Developer velocity and cost-conscious design are areas everyone from individual contributors to managers should have on their mind. In this book, you’ll find advice on how to build quality into the development process, make efficient use of cloud resources, and reduce costs. Additionally, you’ll see the elements that go into monitoring to not only keep tabs on system health and performance but also gain insight into where redesign should be considered. If you manage data engineering teams, you’ll find helpful tips on effective development practices, areas where costs can escalate, and an overall approach to putting the right practices in place to help your team succeed.
Скачать Cost-Effective Data Pipelines: Balancing Trade-Offs When Developing Pipelines in the Cloud (Final Release)