Definition

What is Apache Flink?

Apache Flink is a distributed data processing platform for use in big data applications, primarily involving analysis of data stored in Hadoop clusters. Supporting a combination of in-memory and disk-based processing, Flink handles both batch and stream processing jobs, with data streaming the default implementation and batch jobs running as special-case versions of streaming applications.

Flink was designed as an alternative to MapReduce, the batch-only processing engine that was paired with the Hadoop Distributed File System (HDFS) in Hadoop's initial incarnation. The Flink software is open source and adheres to The Apache Software Foundation's licensing provisions. Its development is primarily being driven by DataArtisans GmbH, a startup vendor based in Berlin.

How does Apache Flink work?

Flink streaming applications are programmed via a DataStream API using either Java or Scala. These languages, as well as Python, can also be used to program against a complementary DataSet API for processing static data. Flink can be deployed on a single Java virtual machine (JVM) in standalone mode or YARN-based Hadoop clusters, or on cloud systems.

The core Flink runtime supports a pipelined streaming architecture; it also offers a built-in method to support iterative data processing for machine learning and other analytics applications. Dedicated APIs and libraries are provided for development of machine learning programs, as well as string handling, graph processing and other uses. Another API is focused on Hadoop application integration.

How has Apache Flink evolved?

Flink arose as an offshoot of Stratosphere, a project begun in 2009 at three universities in Germany: TU Berlin, Humboldt University of Berlin and the Hasso Plattner Institute. The Flink technology subsequently became an Apache incubator project in April 2014 and a top-level project late that year; after nine earlier releases, Apache Flink 1.0.0 was released in March 2016. With that, Flink officially joined other Hadoop ecosystem frameworks such as Spark, Storm and Samza in the competition to provide big data streaming capabilities.

This was last updated in October 2021

Continue Reading About What is Apache Flink?

Dig Deeper on Data management strategies