Learning and Operating Presto e-Book:
Learning and Operating Presto ebook download free book in pdf published by Oreilly Media, Author by Vivek Bharathan, David E. Simmen, George Wang, released in January 2022 (Early Access).
The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this distributed SQL query engine can be challenging even for the most experienced engineers. This practical book shows you how to begin Presto operations at your organization to derive insights on datasets wherever they reside.
Authors Vivek Bharathan, David Simmen, and George Wang explain what Presto is, where it came from, and how it differs from other data warehousing solutions. You'll discover why Facebook, Uber, Twitter, and cloud providers including AWS, Google Cloud, and Alibaba use Presto and how you can quickly deploy Presto in production.
You'll learn about:
- Presto security and administration
- Syntax and connectors
- Top 15 key configuration parameters
- Clusters and tuning
- Troubleshooting: logs, error messages, and more
- Extending Presto for real-time business insight
- Extending PrestoDB
Back in 2014, I was part of the Big Data Platform team at Netflix and we were faced with the challenge of building a large-scale interactive data analytics platform for the whole company. Netflix was one of the early pioneers to bring a data-driven culture to video entertainment; the various product teams used a data-driven approach to making product decisions. Therefore, my team played a significant part in helping those teams make sense of their multi-petabytes of data in order to gain product and consumer insights—recommendations, user experiences, and audience segmentation.
As an early user of public cloud providers, namely Amazon Web Services (AWS), we had moved our big data platform to an architecture which separated the compute and storage layers. This allowed for a separation of concerns: running compute clusters for the analytic jobs and storing data at the lowest cost per terabyte.
While at that time most of our users were either running Pig jobs, which could take hours, or running interactive jobs on a much smaller subset of data on 3rd party products, their desire for better data-driven insights created the need for low-latency interactive data exploration on our much broader set of data on Amazon S3.
We have evaluated multiple systems at that time with our production workloads in terms of performance and reliability, and Presto, which was initially released to open source around the end of 2013, was the clear winner. To productionize Presto we have worked on improving Presto’s AWS & Parquet file format integration in addition to helping make Presto much more reliable, scalable, and also elastic to make it a first class citizen in Netflix’s data infrastructure. That epitomizes what is great about open source, when you have an itch, you can scratch it! Since Presto supports ANSI SQL, it was very easy for our analysts and developers to utilize it. The amount of jobs grew quickly, and before we knew it Presto became an integral part of the Netflix data ecosystem. We were one of the first major users outside of Facebook to leverage Presto at scale. Learning and Operating Presto ebook download free book in pdf published by Oreilly Media, Author by Vivek Bharathan, David E. Simmen, George Wang, released in January 2022 (Early Access).
Why This Book Is Important
Since those early days, companies of all sizes and types have sought out ways to become more data driven in their own business. The days of “build it, sell it, and ship it” no longer apply in today’s competitive environment. As the saying goes, “innovate or die” has become a mantra within most companies, causing them to explore larger amounts of data at shorter intervals than ever before. The companies that do this well are considered to be part of a new Data Economy; that is, they are companies that derive significant additional value from their data.
Starting with Facebook, the Presto community has expanded at an astonishing pace. Beyond Netflix, other internet-scale companies like Uber, Twitter, LinkedIn, and JD.com leverage Presto for a wide variety of use cases. In addition, major cloud providers and vendors also have Presto offerings.
Presto is one of the fastest growing open source projects in data analytics today because it fits well with that data-driven paradigm shift. I believe that’s due to three primary reasons: 1) Presto is based on ANSI SQL so it’s easy for people to get running with it, 2) the Presto connector architecture enables the federated access of almost any data source, whether a database, data lake, or other data system, and 3) Presto can start from one node and scale to thousands. As such, there is now an energized community around Presto, having started with the Facebook Presto team, and having grown to hundreds of contributors around the world.
While Presto adoption has been nothing but amazing, there’s still many more people who want to get started with Presto. Presto is a complex distributed system that runs on many machines to process diverse workloads from multiple users, and in a typical deployment it interacts with multiple storage systems. To add to this complexity Presto has plenty of configuration knobs to change its behavior and sometimes to get the best performance, queries or the system itself needs to be tuned carefully. Therefore, it can be challenging even for experienced engineers to ramp up with the Presto architecture and its operations.
That’s where this book comes in. Learning and Operating Presto is an approachable on-ramp to using Presto and getting Presto into a production deployment. It is intended to make Presto more accessible to anyone who cares about their organization becoming data driven. Regardless of whether you’re a current user of Presto, new to Presto, or going to provide Presto to your users, this book does an excellent job of explaining what you need to know.
I encourage you to supplement what you learn in this book by participating in Presto’s growing community, which is filled with experienced users, developers, and data architects. Join the PrestoDB Slack channel and attend the meetups and events. Use this book as your entry point into the world of Presto.
Prospective Table of Contents
- Chapter 1: Introducing Presto
- Chapter 2: Getting Started
- Chapter 3: Security
- Chapter 4: Administration
- Chapter 5: Syntax
- Chapter 6: Connectors
- Chapter 7: Top 15 Key Configuration Parameters
- Chapter 8: Clusters
- Chapter 9: Tuning
- Chapter 10: Operating Presto at Scale
- Chapter 11: Troubleshooting: Logs, Error Messages and More
- Chapter 12: Real-time Analytics for Real-time Business Insights: Presto + Apache Pinot
- Chapter 13: Extending Presto
About the Publisher
O’Reilly’s mission is to change the world by sharing the knowledge of innovators. For over 40 years, we’ve inspired companies and individuals to do new things—and do things better—by providing them with the skills and understanding that are necessary for success.