VLDB2021
The evolution of Amazon Redshift
Ippokratis Pandis
被引用 22 次
摘要
In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift [7] , the first fully managed, petabyte-scale enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This launch was a significant leap from the traditional on-premise data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Customers embraced Amazon Redshift and it became the fastest growing service in AWS. Today, tens of thousands of customers use Amazon Redshift in AWS's global infrastructure of 25 launched Regions and 81 Availability Zones (AZs), to process exabytes of data daily. The success of Amazon Redshift inspired a lot of innovation in the analytics segment, e.g. [1, 2, 4, 10] , which in turn has benefited customers. In the last few years, the use cases for Amazon Redshift have evolved and in response, Amazon Redshift continues to deliver a series of innovations that delight customers. In this paper, we give an overview of Amazon Redshift's system architecture. Amazon Redshift is a columnar MPP data warehouse [7] . As shown in Figure 1 , an Amazon Redshift compute cluster consists of a coordinator node, called the leader node, and multiple compute nodes. Data is stored on Redshift Managed Storage, backed by Amazon S3, and cached in compute nodes on locally-attached SSDs in compressed columnar fashion. Tables are either replicated on every compute node or partitioned into multiple buckets that are distributed among all compute nodes. AQUA is a query acceleration layer that leverages FPGAs to improve performance. CaaS is a caching microservice of optimized generated code for the various query fragments executed in the Amazon Redshift fleet. The innovation at Amazon Redshift continues at accelerated pace. Its development is centered around four streams. First, Amazon Redshift strives to provide industry-leading data warehousing performance. Amazon Redshift's query execution blends database operators in each query fragment via code generation. It combines prefetching and vectorized execution with code generation to achieve maximum efficiency. This allows Amazon Redshift to scale linearly when processing from a few terabytes to petabytes