FlinkSketch: Democratizing the Benefits of Sketches for the Flink Community

Lightning Talk

Enterprises ingest and analyze massive volumes of streaming data in Flink to analyze and derive real-time insights. For instance, financial institutions process credit card transactions to monitor risk and detect fraud, while observability platforms ingest telemetry data to monitor application performance. While traditional Flink analytics pipelines have served us well so far, the rising scale and complexity of data are causing an untenable increase in cloud costs as well as increased latency that prohibits real-time decision-making. Thus, there is a need to rethink the design of aggregate analytics pipelines.

Sketching algorithms provide an effective alternative to traditional aggregation by leveraging compact, probabilistic data structures to provide highly accurate and low-cost analytics. These algorithms are designed to estimate various aggregates like distinct counts, frequency, and quantiles, and are amenable to massively parallel processing. Sketches are backed by extensive research and estimate aggregates, with mathematically bounded errors. Unfortunately, implementations of these algorithms have not made it into the Flink ecosystem, preventing the Flink community from reaping their benefits.

We have provided a library of sketches for Flink by integrating the Apache DataSketches library, an open-source library of sketches, into the Flink ecosystem. Users can use our library with the Flink DataStream API or through a declarative YAML configuration where they can specify the sketches to use and their parameters, what labels to key by, etc.. We are integrating newer sketches like UnivMon, Hydra, and DDSketch, which provide novel capabilities. We are in the process of open-sourcing our implementation and initial benchmark results, and hope that the community can benefit from this effort.

Songting Wang

Carnegie Mellon University

Milind Srivastava

Carnegie Mellon University