Current New Orleans 2025
Session Archive
Check out our session archive to catch up on anything you missed or rewatch your favorites to make sure you hear all of the industry-changing insights from the best minds in data streaming.
Metadata: From Zookeeper to KRaft
As a distributed system, Kafka does not only require streaming data to function; it also needs to know information about the distributed system itself, known as metadata. Which node is the controller? What is the topic configuration? Who is assigned to access control lists (ACLs)? Metadata answers all of these questions. Metadata management with Kafka has evolved through its history, and recently underwent its most drastic change: a migration from ZooKeeper to KRaft. In this session, we will discuss what metadata is and why it is important in an event streaming system. We will also explore the difference between ZooKeeper and KRaft, learn the reasons why a migration is necessary, and understand the general operations that may be followed for migrating with the Confluent Platform. ZooKeeper has been paramount to the Kafka ecosystem since its inception. Being a general system for managing distributed metadata, ZooKeeper requires separate nodes within a cluster. For this reason, and more that we will discuss, a quorum-based approach was developed from the ground-up. KRaft, based on the Raft consensus algorithm, allows each Kafka broker to dually act as a metadata manager, reducing complexity and infrastructure management requirements. Migrating from ZooKeeper to KRaft should be done carefully. Metadata is often expected to be persisted permanently; for example, ACLs for authorization are managed as metadata. A loss of metadata during migration may result in disruptive data loss; however, Confluent has released tools and processes to make this process safer. We will demonstrate such a process, known as dual_write mode, as we walk through a safety-first migration using Confluent Platform and Ansible. Through an increased knowledge of metadata management with Kafka, this session seeks to grant you a greater peace-of-mind during your next migration from ZooKeeper to KRaft.
Phillip Groves
How Reddit Uses Flink Stream Joins in Its Real-Time Safety Systems
Acting on policy-violating content as quickly as possible is a top priority of Reddit’s Safety team and is accomplished through technologies such as Rule-Executor-V2 (REV2), a real-time rules-engine that processes streams of events flowing through Reddit. While a low time-to-process latency, measured as the time it takes for some activity on the site to flow through REV2, is an important metric to optimize for, it is equally important for REV2 to be able to identify more sophisticated policy-violating content. At Reddit, we use advanced machine learning (ML) signals to detect and action such content. In this talk, we will discuss Signals-Joiner, a Flink-based system, which enables REV2 to leverage slower-to-compute ML signals in its real-time context via stream joins. Specifically, we'll walk through the motivation behind the system, the evolution of our architecture (building a custom windowing strategy), key learnings, and the results we've achieved.
Vignesh Raja
Keynote: Building Intelligent Systems on Real-time Data
Join Jay Kreps, Confluent leadership, our customers, and industry thought leaders to learn how you can build intelligent systems with real-time data. We’ll show you why streaming is becoming ubiquitous across the business—and how that unlocks a shift-left approach: process and govern at the source, then reuse everywhere. Expect live demos and candid customer stories that make it concrete. Whether you’re a data leader, architect, or builder, you’ll leave with practical playbooks for bringing real-time AI to production. The future is here-let's ignite it together!
Jay Kreps / Shaun Clowes / Sean Falconer / Rajesh Kandasamy / Cosmo Wolfe / Rachel Lo / Gunther Hagleitner
The Zen of PyFlink: The evolution towards a truly “pythonic” Flink
Have you been a python data streaming engineer who has invested hundreds of hours tracking down errors, going back to documentation? Does your stream processing pipeline go through slow iteration and abandonment because of frustrating results not meeting expectations? Then you are in the right place, where we talk about a new idea around the PyFlink Table API which is aligned with the APIs that Python developers love, like Pandas, Polars and PySpark. There have been some gaps in the PyFlink API and some mismatches that made applying PyFlink stream transformations a bit off. In this session, learn about the changes being proposed to improve the open source PyFlink Table API and make it the stream processing framework of choice for data streaming community! This talk details the findings, improvements and plans being made in the PyFlink open source community with support from Confluent, Alibaba, OpenAI and others to move beyond the hard edges and idiosyncrasies of Apache Flink’s established JVM ecosystem. Join us to learn how to contribute and make PyFlink a truly loveable product.
Zander Matheson
FlinkSketch: Democratizing the Benefits of Sketches for the Flink Community
Enterprises ingest and analyze massive volumes of streaming data in Flink to analyze and derive real-time insights. For instance, financial institutions process credit card transactions to monitor risk and detect fraud, while observability platforms ingest telemetry data to monitor application performance. While traditional Flink analytics pipelines have served us well so far, the rising scale and complexity of data are causing an untenable increase in cloud costs as well as increased latency that prohibits real-time decision-making. Thus, there is a need to rethink the design of aggregate analytics pipelines. Sketching algorithms provide an effective alternative to traditional aggregation by leveraging compact, probabilistic data structures to provide highly accurate and low-cost analytics. These algorithms are designed to estimate various aggregates like distinct counts, frequency, and quantiles, and are amenable to massively parallel processing. Sketches are backed by extensive research and estimate aggregates, with mathematically bounded errors. Unfortunately, implementations of these algorithms have not made it into the Flink ecosystem, preventing the Flink community from reaping their benefits. We have provided a library of sketches for Flink by integrating the Apache DataSketches library, an open-source library of sketches, into the Flink ecosystem. Users can use our library with the Flink DataStream API or through a declarative YAML configuration where they can specify the sketches to use and their parameters, what labels to key by, etc.. We are integrating newer sketches like UnivMon, Hydra, and DDSketch, which provide novel capabilities. We are in the process of open-sourcing our implementation and initial benchmark results, and hope that the community can benefit from this effort.
Songting Wang / Milind Srivastava
Unlocking Inter-Agent Collaboration: Confluent Powers Scalable AI with Google Cloud's A2A
The next frontier in AI is intelligent agentic systems, where agents collaborate to achieve complex goals. Google Cloud's Agent2Agent (A2A) protocol offers a crucial open standard for this inter-agent communication, enabling discovery, coordination, and task execution. However, scaling these multi-agent systems in real-world enterprise environments demands a robust, asynchronous, and resilient communication backbone. This is precisely where Confluent, powered by Apache Kafka and Apache Flink, becomes indispensable. This session will explore the powerful synergy between Confluent Cloud and Google Cloud's A2A protocol. We'll delve into architectural patterns leveraging Kafka topics as the shared, real-time central nervous system for A2A message exchange, ensuring unparalleled scalability and decoupling. Attendees will learn how Confluent's fully managed services, including Apache Flink and comprehensive connectors, facilitate seamless data flow, real-time processing, and contextual enrichment of agent communications, enabling consistency and integrity. This makes agent interactions inherently trackable, shareable, and composable across your systems. Discover through practical use cases how Confluent's platform capabilities empower AI agents for intelligent automation and dynamic orchestration within the Google Cloud environment. This session will demonstrate why Confluent is the foundational platform for building truly scalable, integrated, and reliable AI ecosystems with Google Cloud Agent2Agent.
Dustin Shammo / Merlin Yamssi / Pascal Vantrepote
Scaling Agentic AI Delivery: How Infosys Leverages the Confluent OEM Program
Join Infosys, a global leader in technology services, for a 1:1 discussion on the state of data streaming. Hear how evolving customer demands and the rise of agentic AI are shaping their technology strategy, and why Infosys chose to partner with Confluent over open source Kafka to accelerate innovation and deliver differentiated solutions. During the session, you’ll learn about Confluent’s OEM Program—enabling product teams to embed a complete, enterprise-grade data streaming platform directly into their offerings, reducing engineering overhead while opening new revenue opportunities. Walk away with executive-level insights into how real-time data streaming is powering agentic AI use cases and how your organization can accelerate delivery of new data-driven capabilities with Confluent.
Greg Murphy / Paresh Oswal / Seth Catalli
Change Data Capture at Scale: Insights from Slack’s Streaming Pipeline
Slack was burning cash on batch data replication, with full-table restores causing multi-day latency. To slash both costs and lag, we overhauled our data infrastructure—replacing batch jobs with Change Data Capture (CDC) streams powered by Debezium, Vitess, and Kafka Connect. We scaled to thousands of shards and streamed petabytes of data. This talk focuses on the open source contributions we made to build scalable, maintainable, and reliable CDC infrastructure at Slack.We'll show how we cut snapshotting time—from weeks to hours—for our largest table, half a petabyte in size and spread across hundreds of shards. You’ll learn how to apply our optimizations in Debezium, tune Kafka Connect configs, and maximize throughput. We’ll also cover how we tackled one of streaming’s most elusive challenges: detecting accurate time windows. By contributing a binlog event watermarking system to Vitess and Debezium, we made it possible to ensure correctness in a distributed system with variable lag. Finally, we’ll show you how to detect & prevent data loss in your own pipelines by applying the fixes we contributed to Kafka Connect and Debezium, which addressed subtle edge cases we uncovered in these systems.Attendees will leave with practical techniques for deploying, scaling, and maintaining reliable CDC pipelines using open source tools—and a deeper understanding of how to avoid the common (and costly) pitfalls that can hinder the success of streaming data pipelines.
Tom Thornton
Unifying Kafka and Relational Databases for Event Streaming Applications
Kafka and relational databases have long been part of event-driven architectures and streaming applications. However, Kafka topics and database tables have historically been separate abstractions with independent storage and transaction mechanisms. Making them work together seamlessly can be challenging, especially because queuing has been viewed as an anti-pattern in a stock database.This talk will describe how to close this gap by providing a customized queuing abstraction inside the database that can be accessed via both SQL and Kafka’s Java APIs. Since topics are directly supported by the database engine, applications can easily leverage ACID properties of local database transactions allowing exactly-once event processing. Patterns such as Transactional Outbox (writing a data value and sending an event) or any atomicity required across many discrete database and streaming operations can be supported out of the box. In addition, the full power of SQL queries can be used to view records in topics and also to join records in topics with rows in database tables.In this talk we cover the synergy between Kafka's Java APIs, SQL, and the transactional capabilities of the Oracle Database. We describe the implementation, which uses a transactional event queue (TxEventQ) to implement a Kafka topic and a modified Kafka client that provides a single, unified JDBC connection to the database for event processing and traditional database access.
Nithin Thekkupadam Narayanan
From Queues to Intelligence: The Evolution of Stream Infrastructure for AI
AI workloads demand more than just scalable infrastructure. They require consistent, high-throughput, and reliable data movement that can support constantly evolving models and use cases. In this talk, we'll walk through how stream infrastructure has evolved inside a modern AI company: from simple durable queues to a sophisticated architecture powering various products and research. We'll cover lessons learned in scaling core systems like Kafka, Flink to support products like ChatGPT and Sora, and how thoughtful design, clear abstractions, and strong observability helped us stay reliable under massive growth. Whether you're focused on infrastructure or AI, this talk will highlight how streaming has become a foundational layer in the AI product stack.
Aravind Suresh
LazyLog: A New Log Abstraction for Low-Latency Applications
Streaming systems, at their core, are shared logs. These traditional shared logs enforce a strict, global order on all incoming data as it is written, which ensures strong consistency but adds significant latency during data ingestion. In practice, we've observed that many modern applications -- such as analytics pipelines and event-driven systems -- don't need this strict order immediately when data is ingested. Instead, the order only matters later, when the data is consumed. Based on this insight, we introduce LazyLog, a new approach to building shared logs. LazyLog delays the costly process of assigning a global order until just before the data is read, rather than at write time. This "lazy" approach significantly reduces write latency while still ensuring a consistent global view when needed. We built two systems that implement LazyLog abstraction. These systems offer the same strong guarantees as traditional systems but with much lower write latencies. For teams building low-latency data pipelines or high-throughput distributed services, LazyLog offers a compelling alternative to conventional log-based systems. LazyLog is the result of academic research at the University of Illinois. The paper about LazyLog has been published at SOSP, the flagship conference for systems research, winning a Best-Paper Award at the conference.
Ram Alagappan
What the Spec?!: New Features in Apache Iceberg™ Table Format V3
Apache Iceberg™ made great advancements going from Table Format V1 to Table Format V2, introducing features like position deletes, advanced metrics, and cleaner metadata abstractions. But with Table Format V3 on the horizon, Iceberg users have even more to look forward to.In this session, we’ll explore some of the exciting new user-facing features that V3 Iceberg is about to introduce and see how they’ll make working with Open Data Formats easier than ever! We’ll go through the high-level details of the new functionality that will be available in V3. Then we’ll dive deep into some of the most impactful features. You’ll learn what Variant types have to offer your semi-structured data, how Row Lineage can enhance CDC capabilities, and more. The community has come together to build yet another great release of the Iceberg spec, so attend and learn about all of the changes coming and how you can take advantage of them in your teams.
Russell Spitzer
Agentic AI Meets Kafka + Flink: Event-Driven Orchestration for Multi-Agent Systems
The rise of agentic AI systems is reshaping how intelligent applications are architected—introducing new levels of autonomy, collaboration, and complexity. As protocols like Agent-to-Agent (A2A) and Model Context Protocol (MCP) become foundational for multi-agent orchestration, the need for robust, scalable, and event-driven infrastructure becomes mission-critical. In this session, we’ll explore how Apache Kafka and Apache Flink serve as the backbone for enabling multi-agent systems to operate reliably and responsively in high-throughput environments. From maintaining contextual memory to routing asynchronous and synchronous requests, we’ll break down the architectural patterns that support real-time, protocol-compliant agent communication at scale. You’ll learn how to stream and process multimodal data - from gRPC to REST to JSON payloads - across distributed agent workflows while enforcing data integrity, managing quotas, and maintaining full observability. We’ll also cover how to apply stateful stream processing and fine-grained filtering to ensure agents always act on timely, relevant, and high-quality information. Key Takeaways:- How to integrate Kafka and Flink with agentic AI frameworks using A2A and MCP- Designing asynchronous/synchronous agent workflows with low-latency pipelines- Techniques for streaming multimodal data between AI agents and services- Enabling quota enforcement, usage tracking, and cost visibility with Kafka/Flink- Real-world lessons from building distributed, multi-agent AI systems in productionIf you're working on next-gen AI systems that require context awareness, memory, coordination, and streaming intelligence - this session will show you how to make it real.
Israel Ekpo / Devanshi Thakar (GPS)
An ounce of prevention is worth a pound of cure - Fix data clustering in streaming write to Iceberg
Apache Flink is commonly used to ingest continuous streams of data to Apache Iceberg tables. But it lacks the ability to organize data at write time, which can lead to small files and poor data clustering problems for many use cases. Regular table maintenance, such as compaction and sorting, can help to remediate the problems. But prevention is usually cheaper than remediation. In this talk, we will present a solution that can prevent those problems during streaming ingestion. Range distribution (sorting) is a common technique for data clustering, and many batch engines support it when writing to Iceberg. We will describe the range partitioner that was contributed to Flink Iceberg sink (released in Iceberg 1.7 from late 2024). We will deep dive into how to handle the challenges of unbounded streams, organically evolving traffic patterns, low-cardinality and high-cardinality sort columns, and rescaling writer parallelism. By the end of the session, you will understand the design choices and tradeoffs, and why it is applicable to broad use cases of streaming ingestion.
Steven Wu
Agents Running A Data Mesh
The exponential growth of data demands a paradigm shift in how we discover, create, and evolve data products. This presentation introduces a novel Agentic Data Mesh architecture where AI agents take on proactive roles in the data product development lifecycle. Unlike traditional approaches, our agents don't just process data; they think about it.We envision a system where specialized AI agents, empowered by real-time streaming data from Confluent, proactively identify opportunities for new data products based on business needs, data patterns, and existing data assets. These "Discovery Agents" will propose data product definitions to human subject matter experts for approval, acting as intelligent co-creators.Upon approval, "Creation Agents" will leverage Confluent Tableflow to seamlessly transform Kafka topics into managed Apache Iceberg tables, ensuring schema evolution, ACID compliance, and time-travel capabilities. This automated creation extends to higher-order data products, where agents autonomously combine, refine, and process existing topics with Apache Flink, continually enriching the data landscape.Furthermore, "Analysis Agents" will exploit the robust capabilities of Iceberg tables, performing complex analytical queries and identifying new insights that trigger the creation of even more refined data products. This iterative, agent-driven feedback loop creates an invaluable, self-optimizing data product development lifecycle, minimizing manual intervention and accelerating time-to-insight.Attendees will learn:- The architecture of an agentic data mesh, integrating AI with Confluent's streaming platform and Apache Iceberg.- How AI agents can autonomously propose, create, and refine data products.- Best practices for leveraging Confluent Tableflow for seamless Kafka-to-Iceberg integration in an agentic system.- Strategies for establishing a continuous, self-improving data product development lifecycle.- Real-world implications and potential impact on data governance, data quality, and business agility.
Blake Shaw
Bite Size Topologies: Learning Kafka Streams Concepts One Topology at a Time
Event streaming with Kafka Streams is powerful, but can feel overwhelming to understand and implement. Breaking down advanced concepts into smaller, single-purpose topologies makes learning more approachable. Kafka Streams concepts will be introduced with an interactive web application that allows you to visualize input topics, output topics, changelog topics, state stores, and more. What happens when state store caching is disabled? What if topology optimization is enabled? Or what if stream time isn't advanced? These questions will be easily explored by visualizing the topology and Kafka Streams configurations.This interactive tutorial's real-time events are generated by actual data on your laptop, including running processes, thread details, windows, services, and user sessions. Moving a window on your laptop can trigger many examples, allowing you to see how the topology handles them.The audience will select which topologies to cover in categories of: flow, joins, windowing, advanced state storage usage, and more.Join me on this journey of learning Kafka Streams. You'll deepen your understanding of Kafka Streams concepts and gain access to tools that let you explore advanced concepts independently. All examples and visualization will be available in an open-source project.
Neil Buesing
Deep Dive into Apache Flink 2.1: The Key Features in SQL & AI Integration
Flink 2.1 is the first release following the major Flink 2.0, introducing significant advancements in SQL and AI integration, feature enhancements, and performance optimization. This session will highlight the following features:1. Seamless integration of Flink SQL and AI Model, exploring how to accomplish real-time AI analysis with Flink.2. Flink SQL supports the Variant type to improve the efficiency of real-time analysis of semi-structured data in Lakehouse.3. How Flink SQL addresses the performance bottleneck of multi-streaming join cases, including the introduction of various streaming optimization algorithms, such as delta join and multi-way join, ensuring efficient and scalable stream processing for modern data pipelines. 4. Flink integrates with Lance AI Format, which gives Flink the ability to handle multimodal data, and opens a new journey of AI workload.We hope that attendees will take away something from this session and let Apache Flink help you grow your business!
Ron Liu
Sizing, Benchmarking and Performance Tuning Apache Flink Clusters
A common question when adopting Apache Flink is about sizing the workload: How many CPUs, how much memory will Flink require for a particular use-case? What throughput and latency can you expect given your hardware?We’ll kick off this talk discussing why these questions are extremely difficult to answer for a generic stream processing framework like Flink. But we won’t stop there. The best approach to answer sizing questions is to benchmark your Flink workload. We will present how we’ve set up a Flink SQL-based benchmarking environment and some benchmarking results for attendees to correlate our results with their workloads to approximate their resource requirements.Naturally when benchmarking, the topic of performance tuning comes up: Are you optimally using the allocated resources? How to identify performance bottlenecks? What are the most common performance issues, and how to resolve them? In our case, a few configuration changes improved the throughput from 230mb/s to over 3200mb/s. How many CPU cores are needed for that in Flink? Attend the talk to find out, it's less than you would expect.This talk is for both Flink beginners wanting to get an idea about Flink’s performance and operational behavior, as well as for advanced users looking for best practices to improve performance and efficiency.
Robert Metzger
More than query: Morel, SQL and the evolution of data languages
What is the difference between a query language and a general-purpose programming language? Can SQL be extended to support streaming, incremental computation, data engineering, and general-purpose programming? How well does SQL fit into a modern software engineering workflow, with Git-based version control, CI, refactoring, and AI-assisted coding?These are all questions that drove the creation of Morel. Morel is a new functional programming language that embeds relational algebra, so it is as powerful as SQL. Morel's compiler, like that of any good SQL planner, generates scalable distributed programs, including federated SQL. But unlike SQL, Morel is Turing-complete, which means that you can solve the whole problem without leaving Morel.This session will discuss the challenges and opportunities of query languages, especially for streaming and data engineering tasks, and provide a gentle introduction to the Morel language. It is presented by Morel's creator, Julian Hyde, who created Apache Calcite and also pioneered streaming SQL.
Julian Hyde
The Future of Agentic AI is Event-Driven: How to Build Streaming Agents on Apache Flink
At their core, AI agents are microservices with a brain. They're powered by large language models (LLMs) and are independent, specialized, and designed to operate autonomously.But agents need more than LLMs to scale -- they need real-time data access, context, and the ability to collaborate across tools, services, and even other agents. As timely data becomes crucial for modern AI systems, agents must operate within distributed, event-driven environments. This session focuses on how to bridge the gap between streaming infrastructure and agentic architectures, by enabling developers to build, test, and operate agents natively on Flink. Through architecture diagrams, use cases and a demo, we'll show practical steps for getting started with streaming agents to power new automation workflows.
Sean Falconer / Mayank Juneja
Quiet Failures, Loud Consequences: Streaming ML Drift Detection in Practice
A machine learning model in production is like a ship sailing blind, everything looks fine until it slams into a reef. And by then, it's too late.This phenomenon, known as concept and model drift, is especially dangerous in real-time systems where decisions happen in milliseconds and rollback is usually not an option.If not detected early, drift doesn’t just break your models — it misprices loans, misses fraud, and even risks lives.This talk distills cutting-edge research and real production lessons into practical tools that can be apply today, even if the models are already in the wild. Based on ongoing PhD research and real-world implementations, we’ll walk through the following real live questions:- How drift manifests in event-driven ML systems — and why traditional batch monitoring fails.- Common algorithms for drift detection (i.e. DDM, EDDM, ADWIN, Page-Hinkley) — and how to benchmark them in streaming environments.- An architecture for integrating drift-aware intelligence into Flink pipelines, with hooks for alerting, model retraining, or failover strategies.- Lessons from production use cases, including trade-offs in detection latency, false positives, and system overhead.Whether you're deploying ML models into dynamic data streams or just planning your streaming AI strategy, you'll leave with a blueprint for building drift-resilient ML pipelines — plus hands-on knowledge to detect, benchmark, and respond to drift before it becomes failure.
Dominique Ronde
The evolution of Notion’s event logging stack
Notion's client and server applications generate billions of events daily. We track these events to understand how our customers are using the product and what kind of performance they experience. Some events also contribute to customer-facing product features. This talk covers the Event Trail platform that enables us to process and route these events in a scalable manner.Event logging at Notion was initially built on third-party services with connectors to our Snowflake data lake. This lacked the scalability and flexibility that we required as our product grew, and so we built Event Trail.Event Trail receives events from the application, augments their content, and then directs them to one or more destinations based on their type. Routing is defined in code with dynamic overrides and honoring of cookie permissions. The most common destinations are Apache Kafka topics powered by Confluent.The data warehouse ingestion pipelines read events from Kafka and write them to Snowflake. They were originally based on Apache Flink and S3 ingestion but have evolved to use Snowpipe Streaming connectors for easier maintenance and scalability. The real-time analytics pipelines use events to power user-facing features like page counters and enterprise workspace analytics. These features have also evolved, from batch results served via DynamoDB and Redis to online calculations via Apache Pinot.
Adam Hudson
Breaking Boundaries: Confluent Migration for Every Stack
In this topic, we will cover the current pain points our clients have experienced and the need for migration to Confluent Cloud. We will highlight Infosys experience in migrating clients from open-source platforms and other Kafka distributions, as well as message brokers, to Confluent Cloud.The migration approach will include:- Setting up Kafka clusters in Confluent Cloud using automation- Replicating topics and data using Confluent’s recommended methods- Seamlessly migrating clients from existing clusters to Confluent Cloud
Prakash Rajbhoj
Evolving the Data Supply Chain: Powering Real-Time Analytics & GenAI with Flink, Iceberg, and Trino
In the rapidly evolving landscape of data-driven enterprises, the ability to harness and process vast amounts of information in real-time is paramount. This talk will revisit the concept of the Data Supply Chain, a framework that enables AI at an enterprise scale, and explore cutting-edge technologies that are transforming data streaming and processing capabilities.Building on insights from last year's presentation (https://www.youtube.com/watch?v=Zp86b_eaW8g), we will delve into the use of Apache Flink for stream processing, providing a foundation for real-time decision-making and AI applications.We will introduce Tableflow, a tool for Kafka-to-Iceberg materialization. This integration enhances data accessibility, ensuring that data is readily available for analytics and AI workloads.The talk will also highlight the role of Starburst Trino in enabling real-time analytics and agentic workloads over Iceberg tables. By leveraging Trino's powerful uniform data access layer (query engine), enterprises can perform complex analytics on large datasets with unprecedented speed and efficiency. This capability is crucial for organizations aiming to derive actionable insights and drive innovation through AI.Join us as we explore these transformative technologies and their impact on the Data Supply Chain. Attendees will gain valuable insights into optimizing their data infrastructure to support AI initiatives and achieve enterprise-scale success. This session is ideal for data & AI strategists, data engineers, architects, and decision-makers looking to enhance their data streaming capabilities and unlock the full potential of AI in their organizations.
Dylan Gunther / Craig Albritton / Thomas Mahaffey
Diskless but with disks, Leaderless but with leaders: A KIP-1163 Deep Dive
KIP-1150: Diskless Topics promises to make Apache Kafka more cost effective and flexible than ever before, but how does it work? Where does the cost savings come from? Is it really Diskless? What about Leaderless? Why is the latency worse? This talk will walk through the design for the preferred implementation in KIP-1163: Diskless Core, and answer all of these questions.A basic understanding of Apache Kafka is enough to attend this talk: we’ll review the architecture used for classic and tiered topics, and how data is produced and fetched. We'll discuss the limitations of this architecture in the context of modern hyperscaler cloud deployments, and where the costs become excessive. Then we’ll show how the basic components of Kafka are taken apart and reassembled to build the Diskless architecture. We’ll also discuss the major rejected alternatives, and compare KIP-1163 to similar KIPs working to solve the same problem. At the end of this session, you should feel confident talking to stakeholders and community members about this amazing upcoming feature!
Greg Harris
From Pawns to Pipelines: Learning Flink Through the Mind of a Chess Player
Streaming systems and chess have more in common than you think: both revolve around sequences, state, timing, and pattern recognition. This talk introduces Apache Flink through the lens of chess - using the familiar game to make real-time concepts more accessible to engineers and data practitioners of all levels. We’ll explore how core Flink abstractions map naturally to chess ideas: Streams as sequences of moves Tables as the evolving board state Windows as segments of the game (openings, tactics, endgames) CEP (Complex Event Processing) as spotting tactics and combinations ...and more Each concept is reinforced with practical SQL examples and real-world analogies from production Flink use cases, like customer behavior modeling and fraud detection. This talk is designed for beginners in Flink and data streaming who want to build a solid foundation, but it also offers a fresh educational approach for instructors, solution architects, and engineers who explain Flink to others by linking abstract streaming mechanics to a concrete mental model, we aim to make Flink both intuitive and memorable. Whether your Flink skills are at 800 Elo or 2000, or you're just learning how a knight moves, you'll leave with a stronger intuition for how stream processing works
Vish Srinivasan
Unpacking Serialization in Apache Kafka: Down the Rabbit Hole
Picture this: your Kafka application is humming along perfectly in development, but in production, throughput tanks and latency spikes. The culprit? That "simple" serialization choice you made without much thought. What seemed like a minor technical detail just became your biggest bottleneck.Every Kafka record—whether flowing through KafkaProducer, KafkaConsumer, Streams, or Connect—must be converted to bytes over TCP connections. This serialization step occupies a tiny footprint in your code but wields outsized influence over your application's performance. For Kafka Streams stateful operations, this impact multiplies as records serialize and deserialize on every state store access.You could grab a serializer that ships with Kafka and call it done. But depending on your data structure and use patterns, the wrong choice can cost you critical performance. The right choice can transform your application from sluggish to lightning-fast.This talk dives deep into serialization performance comparisons across different scenarios. We'll explore critical trade-offs: the governance and evolution benefits of Schema Registry versus the raw speed of high-performance serializers. You'll see real benchmarks, understand format internals, and learn exactly when to apply each approach.Whether you're building low-latency trading systems or high-throughput data pipelines, you'll leave with concrete knowledge to optimize one of Kafka's most impactful—yet overlooked—components. Don't let serialization be your silent performance killer.
Bill Bejeck
From Tower of Babel to Babel Fish: Evolving Your Kafka Architecture With Schema Registry
You’ve conquered the basics – Kafka clusters are running, producers are producing, and consumers are consuming. Life is good...until your Python team needs to talk to your Spring Boot services, and suddenly, everyone’s speaking different languages. Like the biblical Tower of Babel, your elegant event-driven architecture crumbles under the weight of miscommunication.What if there was a Babel Fish for your distributed systems? A way to let each service speak its native tongue while ensuring perfect understanding across your entire ecosystem?This talk will explore how Schema Registry transforms from “that optional component you skipped” into the essential backbone of resilient, polyglot Kafka architectures. You’ll discover practical strategies for implementing data contracts that evolve without breaking, patterns for seamlessly integrating Schema Registry into your CI/CD pipelines, and real-world approaches for managing schema evolution without derailing your development velocity.Whether scaling beyond your first language, preparing for a multi-team Kafka implementation, or recovering from your first production schema disaster, you’ll leave with concrete techniques to make your Kafka systems more resilient, flexible, and ready for Day 2 challenges.
Viktor Gamov
Orchestrating a Successful Kafka Migration
Migrating a mission-critical Kafka ecosystem is no small feat—especially when that ecosystem powers the backbone of Nordstrom’s digital operations. Supporting over 220 engineering teams, thousands of applications, and an expansive landscape of Kafka topics and streams, this migration to Confluent Cloud was a monumental undertaking. In this talk, we’ll dive deep into the strategies, tools, and lessons learned from Nordstrom’s Kafka migration journey. From assessing readiness to orchestrating data replication and coordinating across hundreds of teams, we’ll break down the critical components that made this migration a success. While Kafka was the star of the show, we also leveraged tools like Temporal to streamline and automate key workflows, helping us manage long-running processes and reduce operational risks. Temporal played an essential role in ensuring seamless coordination, but the real focus is on how we handled Kafka-specific challenges—such as maintaining data integrity, minimizing downtime, and ensuring business continuity. Key takeaways include: - How to plan and execute a large-scale Kafka migration. Strategies for ensuring zero data loss and uninterrupted streaming. - Lessons from coordinating diverse teams and applications in a high-stakes migration. - Whether you’re planning your own Kafka migration or simply want insights into managing large-scale streaming infrastructure, this talk will provide practical knowledge and real-world examples you can apply to your own systems.
Jack Burns
FlinkAI: Building a Real-Time LLM Knowledge Engine for Apache Flink... with Flink!
How do you keep your developers effective when your internal Flink practices diverge from the open-source community? At Yahoo, we faced this challenge by building FlinkAI, a smart knowledge system that bridges the gap between our internal expertise and the global Apache Flink community. In this session, we'll show you how we use Apache Flink itself to power a real-time streaming pipeline that ingests, processes, and understands Flink knowledge. FlinkAI consumes everything from our internal deployment guides (EKS, mTLS, Okta) to external community firehoses like mailing lists, Jiras, and commits. Using OpenAI, this data is transformed into semantic embeddings and stored in a vector database for lightning-fast natural language search. The best part? It’s integrated directly into the Flink Web UI. FlinkAI automatically analyzes exceptions as they happen and suggests solutions in an embedded chat, turning the UI into an active troubleshooting assistant. Come to this session to learn: A novel architecture for a streaming-first, LLM-powered knowledge system. How to leverage Flink to build powerful internal tooling for developers. Practical lessons on integrating LLMs and vector databases in a real-time context.
Purshotam Shah
Tuning the Iceberg: Practical Strategies for Optimizing Tables and Queries
Apache Iceberg unlocks scalable, open table formats for the modern data lake—but performance doesn’t come by default. In this talk, we’ll dive into the hands-on techniques and architectural patterns that ensure your Iceberg tables and queries stay lean and lightning-fast. From data compaction and clustering to compression strategies and caching layers, we’ll explore how each lever impacts performance, cost, and query latency. You’ll also learn how modern engines like Dremio optimize queries behind the scenes and how to align your table design with those optimizations. Whether you’re running Iceberg in the cloud or on-prem, this session will give you a practical performance toolkit to get the most from your lakehouse architecture.
Alex Merced
Press Play on Data: Netflix's Journey from Streams to Gaming Insights
Netflix's Data Mesh platform serves as our foundation for stream processing, but recent innovations have dramatically expanded its capabilities and accessibility. This presentation explores how these advancements in Data Mesh enabled the successful development of our Games Analytics Platform, as Netflix's games portfolio expanded to 100+ games across TV, mobile, and web platforms.We'll first trace Data Mesh's evolution from a simple data movement platform to a comprehensive real-time processing ecosystem. Attendees will learn how the platform powers business-critical applications while maintaining security and scalability. A key advancement we'll highlight is the introduction of Streaming SQL, which replaced complex low-level programming with an intuitive, declarative approach. This evolution, alongside robust infrastructure-as-code practices, has democratized streaming data access across Netflix, enabling domain experts to build sophisticated data products without specialized stream processing knowledge.The second part of our presentation showcases these innovations in action through the Games Analytics Platform case study. As Netflix ventured into games, our Games Data team leveraged Data Mesh to build a robust data processing layer that helps scale their data teams to meet the diverse data needs of game stakeholders. We’ll demonstrate how the SQL Processor’s user-friendly features coupled with Infrastructure as Code capabilities within Data Mesh enabled Netflix Games to scale their data and analytics ecosystem with minimal technical overhead. Join us to discover how established data infrastructure can evolve to meet new business challenges, the architectural decisions that facilitated this evolution, and how the synergy between platform innovation and practical application resulted in a scalable data ecosystem supporting Netflix's growing gaming portfolio.
Sujay Jain / Michael Cuthbert
Event Driven Views With Ingest-Materialize-Index Stream Topology
At Indeed, we help people get jobs.Our team supports this mission by ingesting data about employers’ hiring needs, enabling them to manage this data, and transforming it into searchable job advertisements—commonly known as job posts. Creating job posts is a complex and I/O-bound process, requiring enrichment from multiple bounded contexts. Adding to this complexity, many downstream systems must be notified in real-time when any part of a job post’s data changes.To meet these demands, we implemented a system that produces and maintains an Event Driven View (EDV)—a materialized, denormalized representation of job posts that stays up-to-date as changes occur across the business. This view is powered by a novel Ingest–Materialize–Index (IMI) stream topology, which enables us to scale processing while preserving strong observability and reliability guarantees.This talk will explore how we build and operate EDVs using the IMI architecture: - We’ll break down each IMI stage and show how clean separation of concerns enables performance and observability. - We’ll show how we use micro-batching, structured concurrency, and I/O pipelining to manage I/O-bound enrichment at scale. - We’ll share strategies for recycling failed view materializations through rate-limited retry streams. - We’ll cover how we incorporate Change Data Capture (CDC) to reliably notify our downstream clients about changes in our persistent EDVs.
Sage Pierce
Empowering the Disconnected Edge: Shifting Far Left with Predictive Analytics for Naval Ships
A Navy ship is essentially a large edge node with unique complexities…let me explain. While you may not think of a ship as an edge node due to its size, it does share similar use cases that are seen on typical edge-based deployments. Sensor data is collected and needs to be aggregated and disseminated to multiple environments including shore and cloud sites. Sharing data in a denied, disrupted, intermittent, and limited (DDIL) environment presents a significant challenge. A Navy ship, when deployed, can also spend 6+ months out at sea before returning to port. For predictive analytics at the disconnected edge, a key consideration is how to manage software updates, including updates to the analytical models themselves.In this session, we will explore how Confluent (Kafka) and Databricks are solving the problems with predictive analytics at the edge and bridging the operational and analytical domains. We will demonstrate how Cluster Linking can be leveraged with DDIL and smart edge processing by prioritizing topics when bandwidth is restricted. We will use logistics data to develop analytics using Delta Live Tables and mlflow that can be used for predicting failures in equipment on the ship. And finally, how the analytics can be deployed to the ship, while at sea, for real-time reporting using Apache Flink.Attendees will leave with understanding of the complexities of edge-based analytics and a blueprint for setting up a pipeline to overcome those challenges in real-world applications.
Michael Peacock / Andrew Hahn
Future of Streaming: Emerging trends for event driven architectures
JPMC is undertaking a significant data transformation by implementing a next-generation data streaming platform, moving beyond traditional mainframe dependencies. This initiative addresses several challenges, including the expense of mainframe queries, excessive data duplication, silos, high data gravity within the mainframe, and a lack of real-time capabilities that have prevented effective data leverage for critical initiatives like Agentic AI.The strategy involves establishing Kafka as the authoritative copy of data, which facilitates the creation of a centralized source of truth. This approach enables the development of real-time data products that aim for high quality, availability, and global accessibility. By embedding best practices from the outset, such as schema management, Role-Based Access Control (RBAC), and robust metadata, JPMC seeks to ensure that its data is of high quality, secure, and easily discoverable across the enterprise.This foundation is crucial for modernization, supporting Agentic AI and stream operations by providing a reliable and high-quality data backbone. The ability to effectively deliver high-quality, discoverable data with contracts and SLAs is seen as the pivot around which future modernization will occur, moving towards a more automated, non-manual operating environment. This strategic investment allows JPMC to enhance its capabilities and prepare for new innovations.
Matthew Walker
The Curious Case of Streaming Windows
There is basically no stream processing without windowing, and Kafka Streams provides a rich set of built-in windows for aggregations and joins. However, it is often unclear to developers how different window types works, and even more important for what use-case a specific window is a good fit. In particular, sliding windows are often a mystery, and are easily confused with hopping windows.In this talk, we will explain the different window types of Kafka Streams, give guidance when to use what window, and unriddle the curious case of the sliding window. Furthermore, we give a sneak preview into the new "BatchWindow" type, that was proposed via KIP-1127 recently, which unlocks new use-cases that where hard to cover in the past. -- Join this session to become a windowing expert and set yourself up for success with Kafka Streams.
Matthias J Sax
Kroxylicious: Taking a bite out of the Kafka Protocol
As Apache Kafka usage continues to grow, it gets deployed in increasingly sensitive and regulated environments. At the same time, data engineering teams have more and more requirements to satisfy the needs of businesses to support AIs and provide real time business intelligence. Unfortunately, for historical or design reasons, Apache Kafka is not able to provide all the features everybody needs.One solution gaining traction over the last couple of years is to proxy Apache Kafka. This session introduces Kroxylicious, an open source Kafka protocol aware transparent proxy (part of the Commonhaus foundation). Kroxylicious offers developers a standardised Filter API to allow them to customize the messages passing through the proxy as well as a plug-in based extension mechanism to allow them to interact with remote resources, such as Key Management Systems or Schema registries. All this is completely invisible to clients and clusters and does not require updating them.Out of the box Kroxylicious provides a customizable record encryption to ensure data at rest is safe even if you use a cloud provider. It also integrates with a schema registry so you can ensure that records sent to specific topics match the configured schemas. As it fully understands the Kafka protocol, it opens the possibility for building a wide range of features such as automatic cluster failover, offloading authentication, multitenancy, etc.At the end of this talk, attendees will understand the core principle and out-of-the-box functionalities of Kroxylicious. They will also know how to run and operate it, as well as know how to incorporate custom business logic.
Sam Barker
Escape the Micro-Maze: Build Fast, Scalable Streaming Services with Apache Flink
So, you're building microservices, and if you're like me, you've probably found yourself wrestling with Kubernetes, trying to manage state, handle failures, and figure out scaling for each service. Someone inevitably says, "Just build it stateless!" and I always think, "I'd love to see that work seamlessly in the real world." I believe there's a more straightforward way to build fast, resilient user experiences.In this talk, I want to share a somewhat radical idea for those of us tired of the traditional microservice shuffle: building our operational logic, and even entire microservices, directly in Apache Flink. I'm not just talking about data pipelines; I'm proposing we start "going operational with Flink," moving beyond its traditional analytical domain.I'll dig into why I think Flink offers a distinct advantage for application development. First, Flink was born for state, and I'll show you how its robust state backends can simplify what's often a major headache in microservice architectures. Then, we'll look at how Flink's inherent fault tolerance and scaling mechanisms can apply to our application logic, not just data processing – meaning less ops and more dev for us. Finally, I'll discuss practical approaches for handling point-to-point calls, comprehensive state management, and general application development patterns within Flink. I've come to think of Flink as an application server, supercharged for streams and state.Join me to see how Apache Flink can simplify our architectures, make our user experiences faster, and potentially let us bid farewell to some of those microservice complexities. And with a bit of help From Kafka streams, we'll see it action
Ben Gamble
Scaling Streaming Computation at LinkedIn: A Multi-Year Journey with Apache Flink
At LinkedIn, stream processing is the foundation for delivering real-time features, metrics, and member experiences across products like Ads AI, Search, Notifications, and Premium. Over the past four years, we’ve built and evolved a fully managed stream processing platform based on Apache Flink to meet increasing demands for scale, state, and reliability.This talk shares our journey from building a self-serve, Kubernetes-native Flink platform to supporting high-throughput, stateful applications with managed Flink SQL. Today, our platform powers thousands of mission-critical pipelines and enables developers to author and deploy jobs declaratively, while abstracting away operational complexity.As workloads grew in complexity and state size, we tackled state management challenges head-on: optimizing checkpointing and recovery, evaluating state storage options, and navigating trade-offs in scalability, cost, and performance. We’ll walk through how we scaled stateful joins, onboarded high-QPS applications, and migrated from Samza and Couchbase to Flink SQL - achieving over 80% hardware cost savings.Key highlights include:- Building a self-serve Flink platform on Kubernetes with split deployment, monitoring, alerting, auto-scaling, and failure recovery- Scaling Flink SQL: challenges and lessons from supporting large-stateful jobs, including state storage choices, state garbage collection (GC) failures, and inefficient job sizing- Diagnosing performance bottlenecks and building a resource estimation model for join-intensive Flink SQL pipelines- Developing tooling for safe migrations, automating reconciliation and backfill workflows, and enabling end-to-end validationWe’ll share the lessons learned and platform investments that helped us scale Apache Flink from early experimentation to a robust, production-grade streaming engine. Whether you're building a Flink-based platform or migrating stateful pipelines at scale, this talk offers actionable insights from operating Flink in production.
Weiqing Yang
Beyond Documentation: AI Agents as Flink Debugging Partners
Operating over 1,000 Apache Flink applications at Stripe has taught us that even the most comprehensive documentation can't eliminate the cognitive load of debugging complex distributed systems. Non experienced flink developers routinely juggle multiple tools—Flink UI, Prometheus metrics, Splunk logs—while cross-referencing extensive runbooks to diagnose failures. This operational overhead inspired us to explore an unconventional solution: integrating AI coding agents directly into our Flink platform.In this talk, we'll share how we transformed Flink debugging from a multi-tool treasure hunt into an intelligent, conversational experience. Our integration enables AI agents to:-- Automatically fetch and correlate metrics -- Parse logs for relevant error patterns-- Navigate our extensive Flink documentation and runbooksGenerate contextual debugging suggestionsThis talk shares our implementation journey, quantitative improvements (x% faster diagnosis), and the critical human-in-the-loop patterns that ensure safety. You'll see real debugging sessions, learn how we chose the right model, and understand where it fails. We'll conclude with actionable insights for teams considering AI-assisted operations.
Pratyush Sharma / Seth Saperstein
Scaling the Past: Productionalizing the Flink History Server for Stream and Batch
Flink powers both streaming and batch data workflows. While Flink’s Web UI is useful for real-time monitoring, it falls short when streaming jobs terminate unexpectedly—losing convenient access to logs, metrics, and exceptions. This gap is even more critical for batch processing, which is where the Flink History Server comes in. The current state of the Flink History Server has many limitations that make it impractical for use. Notable functionality gaps include the local cache size being the primary limiting factor for the number of stored jobs, and the built-in log navigation being rudimentary. In this presentation, I'll share how these issues were addressed by a) improving scalability to support hundreds of jobs for both streaming and batch Flink workflows, b) introducing pluggable storage backends, and c) enabling pluggable log linking handlers. These enhancements significantly improve the Flink History Server's capability to support production workloads. Join us to learn how we’re building a more robust and versatile future for Flink's past. More details can be found in FLIP-505. https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
Allison Chang
Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities
As data architectures evolve to meet the demands of real-time GenAI applications, organizations increasingly need systems that unify streaming and batch processing while maintaining compatibility with existing tools. The Ursa Engine offers a Kafka-API-compatible data streaming engine built on Lakehouse (Iceberg and Delta Lake). Designed to seamlessly integrate with data lakehouse architectures, Ursa extends your lakehouse capabilities by enabling streaming ingestion, transformation and processing — using a Kafka-compatible interface.In this session, we will explore how Ursa Engine augments your existing lakehouses with Kafka-compatible capabilities.Attendees will gain insights into Ursa Engine architecture and real-world use cases of Ursa Engine. Whether you're modernizing legacy systems or building cutting-edge AI-driven applications, discover how Ursa can help you unlock the full potential of your data.
Gaurav Saxena / David Kjerrumgaard
From “Where’s My Money?” to “Here’s Your Bill”: Demystifying Kafka Chargebacks and Showbacks
Have you ever wondered how the money you spent on those kafka clusters is being utilized? Or how much end users should be paying for those awesome use cases that they run in production at scale without worrying about downtime and resiliency? How do you charge that one person who requested those 1000 partition topics, or the one who has like 3 out of 1200 topics but is using about 70% of the available network throughput for your cluster. If you ever wondered about any of these questions, this talk is for you. In this talk, we will deep dive into ways to dissect your Kafka bills and attribute them to your end users, your business teams, your application teams that depend on these Kafka clusters. I will help you understand the fundamentals of how to approach chargebacks/showbacks for Kafka and show you how deep the rabbit hole goes. Using open source tooling and an example, we’ll discuss:* Techniques to define a core identity – an mTLS certificate or a SASL user or a Business unit? Which one is it and which one should it be?* How to envision cost split – Should it be spread evenly or should there be a usage based differentiation for things like network over-utilization? Noisy neighbour anyone?* Chargeback – What should be the final output product of your process and how should it be delivered? Is an excel sheet enough or do you want a dashboard that keeps updating itself automagically?By the end of this talk, you will be able to understand the fundamentals to help you either build out your own cost analysis for Kafka or use the tool to just say - “Here’s your Bill”.
Abhishek Walia
Robinhood’s Use of WarpStream for Logging
As applications scale, so do the cost and complexity of logging. Robinhood has historically used Apache Kafka extensively for its logging needs, but a new technology has emerged. In this session, we'll show developers how to build high-performance, cost-efficient logging pipelines using WarpStream, Confluent's serverless, Kafka-compatible streaming platform. The talk will largely focus on:- A quick introduction of WarpStream as a technology and important features.- Why Robinhood decided to invest in WarpStream for logging workloads.- Advantages and the tradeoffs made moving from WarpStream to Kafka, focusing on areas of performance, reliability, and cost.- The Humio migration process from Kafka to WarpStream to move critical logging workloads while minimizing logging disruptions.If your organization currently runs Kafka to power logging workloads and is interested in exploring WarpStream as a solution, please attend this talk to see how Robinhood has done it to see if there are any learning points that can be applied to your own organization.
Ethan Chen / Renan Rueda
A Deep Dive into Kafka Consumer Rebalance Protocols: Mechanisms and Migration Process Insights
[KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) introduces a new consumer rebalancing protocol to Kafka that differs significantly from its existing one. This session guides attendees through a detailed comparison of the existing and new consumer rebalance protocols and a thorough examination of the migration mechanisms involved in transitioning between them.We offer a comprehensive overview of the fundamentals underlying the classic rebalancing protocols with different assignment strategies, as well as the newly introduced incremental rebalancing protocol. The overview includes the intricacies of group coordination, partition assignment strategies of each protocol.Following the overview, we delve deeply into the mechanisms that drive the migration process between the old and new rebalance protocols. This includes exploration of stop-the-world offline migration methodologies and the more sophisticated online migration techniques that support non-empty group conversions. Detailed case studies illustrate the steps of upgrading and downgrading consumer groups, including handling the intermediate states where the group coordinator manages membership statuses across different protocols.Through these insights, attendees will gain an understanding of the intricacies involved in seamlessly transitioning consumer groups and the improvements brought by the new rebalance protocol. This talk will also be of particular interest to distributed systems professionals who want to know more about the internals of Apache Kafka.
Dongnuo Lyu / David Jacot
Streaming Meets Governance: Building AI-Ready Tables With Confluent Tableflow and Unity Catalog
Learn how Databricks and Confluent are simplifying the path from real-time data to governed, analytics- and AI-ready tables. This session will cover how Confluent Tableflow automatically materializes Kafka topics into Delta tables and registers them with Unity Catalog — eliminating the need for custom streaming pipelines. We’ll walk through how this integration helps data engineers reduce ingestion complexity, enforce data governance and make real-time data immediately usable for analytics and AI.
Jason Pohl / Kasun Indrasiri
GC, JIT and Warmup: The JVM’s Role in Flink at Scale
The JVM plays a critical but often overlooked role in the performance of Apache Flink applications. In this talk, we’ll examine how core JVM mechanisms - garbage collection (GC), Just-In-Time (JIT) compilation, and warmup behavior - can introduce latency, affect throughput, and lead to unpredictable performance in long-running Flink jobs.We’ll break down the impact of GC algorithms on streaming workloads, explore how JIT optimizations can cause performance shifts during job execution, and explain why the warmup phase matters and what can be done about it. We'll be correlating performance charts and GC and compilation logs leaving the attendees with a deeper understanding of how the JVM interacts with Flink's runtime.
Jiří Holuša
Table API on Confluent Cloud: Show me Examples!
Despite being one of Apache Flink's core APIs, the Table API remains a niche choice, especially when compared to Flink SQL. This session aims to highlight the underrated capabilities of the Table API for developing and managing a fleet of streaming pipelines. Crafted with passion by a developer for developers, the talk will be packed with practical examples that demonstrate how to get the streaming job done.
Timo Walther
StreamLink: Real-Time Data Ingestion at OpenAI Scale
In the modern data lakehouse, real-time ingestion isn’t just a nice-to-have – it’s a foundational capability. Model training and evaluation, human analysts, and autonomous AI agents all demand fresh, trustworthy data from diverse sources at massive scale. These expectations are a challenge for platform teams – but they’re also an opportunity to unlock massive business value.At OpenAI, we built StreamLink, a real-time streaming ingestion platform for the data lakehouse, powered by Apache Flink. StreamLink ingests 100+ GiB/s of data from Kafka into Delta Lake and Iceberg tables, supporting 2000 datasets across 20+ partner teams.In this session, we’ll dive deep into the design and implementation of StreamLink. We’ll explore our Kubernetes‑native deployment model (Flink K8s Operator), adaptive autoscaling heuristics, and self‑service onboarding model – all of which keep platform operations lean. Attendees will take away concrete patterns for building scalable, manageable real-time ingestion systems in their own data lakehouse.
Adam Richardson
Powering Real-Time Vehicle Intelligence at Rivian with Apache Flink
At Rivian, our mission is to design and build vehicles that inspire and enable sustainable exploration while delivering a seamless, intelligent user experience. Our connected fleet streams real-time telemetry including sensor data that includes location, battery SOC, etc. To turn this firehose of raw information into instant driver alerts and years of searchable insight, we rely on Apache Flink, Kafka. In this talk, we’ll show how we built a scalable, cloud-native stack that powers real-time features and long-term intelligence.Our vehicles generate a continuous stream of telemetry data. To handle this firehose of information, we’ve built a robust stream processing architecture centered around Flink ingestion pipelines. These pipelines process and enrich the data in real time, powering both internal analytics and external customer experiences.One of the standout components of our platform is Event Watch, which is a Flink-powered feature that allows teams and customers to define streaming jobs that detect key events like abnormal battery drain, collision detection or vehicle movement in/out off a geofence. These events trigger mobile push notifications instantly, enabling proactive maintenance, safety features, and personalized alerts.Beyond real-time event detection, we’ve designed our system for both low-latency responsiveness and long-term analytical depth. Processed telemetry is stored in Databricks Delta tables for scalable historical analysis, while a time series database supports fast, live queries for dashboards and monitoring systems.We’ll walk through how we’ve architected this dual-purpose system; balancing high-throughput stream processing with the flexibility to drill down into historical trends. We’ll also cover how Flink’s stateful processing model enables complex event patterns and reliable delivery, even at scale.Join us to learn how Rivian is building the future of connected vehicles one event stream at a time.
Rupesh More / Guruguha Marur Sreenivasa
Stream All the Things — Patterns of Effective Data Stream Processing
Data streaming is a really difficult problem. Despite 10+ years of attempting to simplify it, teams building real-time data pipelines can spend up to 80% of their time optimizing it or fixing downstream output by handling bad data at the lake. All we want is a service that will be reliable, handle all kinds of data, connect with all kinds of systems, be easy to manage, and scale up and down as our systems change.Oh, it should also have super low latency and result in good data. Is it too much to ask?In this presentation, we’ll discuss the basic challenges of data streaming and introduce a few design and architecture patterns, such as DLQ, used to tackle these challenges.We will then explore how to implement these patterns using Apache Flink and discuss the challenges that real-time AI applications bring to our infra. Difficult problems are difficult, and we offer no silver bullets. Still, we will share pragmatic solutions that have helped many organizations build fast, scalable, and manageable data streaming pipelines.
Adi Polak
Unlocking the Mysteries of Apache Flink
Apache Flink has grown to be a large, complex piece of software that does one thing extremely well: it supports a wide range of stream processing applications with difficult-to-satisfy demands for scalability, high performance, and fault tolerance, all while managing large amounts of application state.Flink owes its success to its adherence to some well-chosen design principles. But many software developers have never worked with a framework organized this way, and struggle to adapt their application ideas to the constraints imposed by Flink's architecture.After helping thousands of developers get started with Flink, I've seen that once you learn to appreciate why Flink's APIs are organized the way they are, it becomes easier to relax and accept what its developers have intended, and to organize your applications accordingly. The key to demystifying Apache Flink is to understand how the combination of stream processing plus application state has influenced its design and APIs. A framework that cares only about batch processing would be much simpler than Flink, and the same would be true for a stream processing framework without support for state.In this talk I will explain how Flink's managed state is organized in its state backends, and how this relates to the programming model exposed by its APIs. We'll look at checkpointing: how it works, the correctness guarantees that Flink offers, and what happens during recovery and rescaling.We'll also look at watermarking, which is a major source of complexity and confusion for new Flink developers. Watermarking epitomizes the requirement Flink has to manage application state in a way that doesn't explode as those applications run continuously on unbounded streams.This talk will give you a mental model for understanding Apache Flink. Along the way we'll walk through several examples, and examine how the Flink runtime supports their requirements.
David Anderson
Smart Action in Real-time: Building Agentic AI Systems Powered by AWS and Confluent Streaming
Agentic AI systems thrive on the combination of real-time data intelligence and autonomous action capabilities. This session demonstrates how to integrate Confluent's scalable data streaming platform with Amazon Bedrock and SageMaker to build responsive, intelligent systems that can both reason and act. We'll explore architectural patterns for ingesting, processing, and serving data streams at scale with end-to-end governance, highlighting how Confluent's pre-built connectors, in-stream processing, and low-latency inference capabilities effectively contextualize foundation models. We'll examine Agentic AI through the lens of event-driven architecture with well-orchestrated AI microservices for maximum effectiveness. Attendees will learn practical approaches to create GenAI applications with enriched data streams, ensuring accurate and responsive model performance. We'll demonstrate how to optimize agentic workflows by leveraging Bedrock Agents, SageMaker, and MCP Servers. Leave with an architectural blueprint and implementation strategies to help your organization reduce AI infrastructure costs and latency while enabling real-time context awareness, system flexibility, and exceptional customer experience.
Weifan Liang / Braeden Quirante
The Kafka Protocol Deconstructed: A Live-Coded Deep Dive
Kafka powers the real-time data infrastructure of countless organizations, but how many of us really understand the magic behind its speed and reliability? What makes a Kafka broker capable of handling millions of events per second while ensuring durability, ordering, and scalability? And why do features like idempotent producers, log compaction, and consumer group rebalance work the way they do?In this deep-dive live-coding session, we’ll dissect Kafka down to its essence and rebuild a minimal, but fully functional, broker from scratch. Starting with a raw TCP socket, we’ll implement:- Kafka’s Binary Wire Protocol: decode Fetch and Produce requests, frame by frame- Log-Structured Storage: the secret behind Kafka’s append-only performance- Batching & Compression: How Kafka turns thousands of messages into one efficient disk write- Consumer Coordination: Group rebalances, offset tracking, and the challenges of "who reads what?"- Replication & Fault Tolerance: why ISR (In-Sync Replicas) is needed for high availability- Idempotence & Exactly-Once Semantics: the hidden complexity behind "no duplicates"Along the way, we’ll expose Kafka’s design superpowers and its tradeoffs, while contrasting our minimal implementation with the real Kafka’s added layers (KRaft, SASL, quotas, etc.).By the end, you won’t just use Kafka, you’ll understand it. Whether you’re debugging a production issue, tuning performance, or just curious about distributed systems, this session will change how you see Kafka.Key Takeaways:- How Kafka’s protocol works- The role of log-structured storage in real-time systems- Why replication and consumer coordination are harder than they look- Where the real Kafka adds complexityNo prior Kafka internals knowledge needed, just a love for distributed systems and live coding.
Mateo Rojas
From cockpit to Kafka: Streaming design lessons from aviation
In aviation, real-time data keeps flights safe, aircraft moving, and operations running smoothly. In this talk, we’ll explore how Kafka-based streaming powers aviation - from orchestrating fast aircraft turnarounds on the ground, to monitoring flight performance in the air, and enabling instant decisions by crew and ground teams through connected operational systems.Drawing on real-world experience building an airline-scale streaming platform, I’ll share practical lessons for platform engineers, including:- Designing for failure, not perfection - making failures predictable, contained, and recoverable through idempotence, DLQs, and retry strategies- Managing transformations at scale – ksqlDB patterns and lessons learned handling complex XML payloads- Isolating workloads through tenant separation - providing streaming corridors for data, compute, and fault containment- Enforcing data contracts - managing schema evolution across disparate aviation operational systems- Keeping it simple in complex environments – building boring, understandable, and debuggable pipelinesYou’ll leave with practical patterns and mental models for building Kafka based streaming platforms that are resilient, trusted, and can operate at scale in safety critical industry.
Simon Aubury
Bringing Stories to Life With AI, Data Streaming and Generative Agents
Storytelling has always been a way to connect and imagine new worlds. Now, with Generative Agents - AI-powered characters that can think, act, and adapt - we can take storytelling to a whole new level. But what if these agents could change and grow in real time, driven by live data streams?Inspired by the Standford's paper "Generative Agents: Interactive Simulacra of Human Behavior", this session explores how to build dynamic, AI-driven worlds using Apache Kafka, Apache Flink, and Apache Iceberg. We'll use a Large Language Model to power for conversation and agent decision-making, integrate Retrieval-Augmented Generation (RAG) for memory storage and retrieval, and use JavaScript to tie it all together. Along the way, we’ll examine different approaches for data processing, storage, and analysis.By the end, you’ll see how data streaming and AI can work together to create lively, evolving virtual communities. Whether you’re into gaming, simulations, research or just exploring what’s possible, this session will give you ideas for building something amazing.
Olena Kutsenko
What Can You Do with a (Kafka) Queue?
The “traditional” consumer group coordination in Apache Kafka assigns each partition of a topic to a member of a consumer group, providing a powerful combination of ordering and scalability. Sometimes ordering is not of the essence, and we would rather treat events as individual units of work. Enter KIP-932 - aka “Queues for Kafka” - enabling multiple consumers in the group to process from the same topic-partition. Every time I have spoken on this topic, I get the same questions: What are some use cases and when do I use it? So let’s take some time to identify and explore a couple of use cases. We’ll walk through code samples for these scenarios and how we can validate the behavior. When we wrap up, you’ll have a better idea of how and where to get started using queues the Kafka way.
Sandon Jacobs
Real-Time Data Infrastructure at Scale: Lessons from Meta's Streaming Architecture
I'll deliver a concentrated dose of hard-won lessons from building and operating one of the world's largest real-time data processing systems. In just 15 minutes, this talk will share the most critical insights from processing hundreds of terabytes daily across billions of users, focusing on the breakthrough moments and hard-learned principles that transformed our streaming architecture. This isn't a broad survey—it's a focused deep dive into the three most impactful challenges we've solved at Meta's scale, delivered with the intensity and practical focus that comes from years of production experience. Attendees will leave with immediately actionable strategies that can be applied regardless of their current scale. We'll start by establishing the unique constraints of processing streaming data at Meta's scale - where traditional solutions break down and custom approaches become necessary. I'll share specific numbers around throughput, latency requirements, and the complexity of coordinating thousands of internal teams sharing the same infrastructure. Lightning Architecture Overview A rapid but comprehensive walkthrough of Meta's Scribe-Puma-Ptail streaming stack, focusing on the key architectural decisions that enabled massive scale rather than implementation details. The Three Critical Breakthroughs I'll focus on the three most transformative solutions we've implemented: Schema Evolution Without Downtime: How we handle thousands of evolving event schemas across product teams while maintaining backward compatibility Multi-Tenant Resource Isolation: Our approach to preventing noisy neighbor problems when thousands of teams share the same streaming infrastructure Cross-DC Failure Recovery: Battle-tested strategies for maintaining consistency during regional outages Each section will include a specific production war story, the technical solution we implemented, and the key principles other teams can apply.
Vivek Chittireddy
Beyond Message Key Parallelism: Increasing Dropbox Dash AI Ingestion Throughput by 100x
The core theme of the talk is to build upon existing parallel consumer work that allows for message key level parallelism while retaining ordering guarantees without provisioning additional partitions. We extend the principles by decomposing messages into smaller sub-messages, allowing for messages with the same key to be processed simultaneously while still retaining ordering guarantees. The sub-message parallel consumer allows for faster time to market, at lower latency and cost versus existing methods presented in literature.This talk walks through a very real scenario we experienced when scaling up Dropbox Dash (AI assistant): the deadline to onboard a customer is tomorrow morning, but the backlog needs 2 weeks to finish processing due to poor key choice in early stages of development leading to every message ending up on the same partition.I will recap/summarize existing topics to set the context:1. Conventional Kafka parallelism (partition level)2. Message key level parallelism using techniques discussed for the Confluent Parallel ConsumerI will also present the additional constraints that we face in our own system:1. It is not feasible to change the producer quickly due to other consumers depending on the event stream2. Long chain of messages with same key rendering key based message level parallelism ineffective3. Extra latency + monetary cost to consume, break down messages, produce, consume again not desirableI will present the novel method we adopted to clear the backlog with ~100x throughput gain and onboard the customer on time: sub-message parallel consumer, the constraints it functions under, and the intuition for the proof of why it works. I will provide some benchmarks around the performance, and close out the talk with Q&A.Key takeaways:1. Kafka messages can be parallelized beyond whole messages2. Clever processing on consumer side can result in lower latency and costs vs breaking down messages and re-producing them, while not affecting other consumers on the same topic the way a producer side change would
David Yun