Scaling the Past: Productionalizing the Flink History Server for Stream and Batch

Lightning Talk

Flink powers both streaming and batch data workflows. While Flink’s Web UI is useful for real-time monitoring, it falls short when streaming jobs terminate unexpectedly—losing convenient access to logs, metrics, and exceptions. This gap is even more critical for batch processing, which is where the Flink History Server comes in.

The current state of the Flink History Server has many limitations that make it impractical for use. Notable functionality gaps include the local cache size being the primary limiting factor for the number of stored jobs, and the built-in log navigation being rudimentary. In this presentation, I'll share how these issues were addressed by a) improving scalability to support hundreds of jobs for both streaming and batch Flink workflows, b) introducing pluggable storage backends, and c) enabling pluggable log linking handlers. These enhancements significantly improve the Flink History Server's capability to support production workloads.

Join us to learn how we’re building a more robust and versatile future for Flink's past. More details can be found in FLIP-505.

https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch

Allison Chang

Discord