(You can find part 2 here)
At the end of 2016, data Artisans organized the first-ever Apache Flink® user survey in order to better understand Flink usage in the community, asking for feedback about both common patterns and the most-needed Flink features.
The results are in, and we’ll be sharing them in a two-post series. This first post will include a summary of answers to the survey’s multiple-choice questions, and the second post will include written answers to open-ended questions that respondents gave us permission to share anonymously.
For context, here’s some general information about the survey:
- We collected responses between 18 Nov 2016 – 13 Dec 2016
- The survey was distributed via the Apache Flink mailing lists, the data Artisans Twitter account, and Apache Flink meetup groups around the world
- In total, 119 respondents from 21 different countries answered at least 1 question; note that each graph includes a count of respondents for that particular question
If you’d like to download a single file with all 5 of the graph images from this post, you can do so here.
First, a fun one: where in the world are Flink users? The Flink community has long been a global one, with 27% of respondents are based in the United States with many more throughout continental Europe, South America, and Asia.
Flink Usage
Next, we’ll look at a few basic Flink usage metrics.
- Just over ⅓ of respondents either have or had a Flink application running in production
- An overwhelming majority (91%) use Flink’s DataStream API, while just over half (55%) use the DataSet API as well
- Java is the most popular language for developing in Flink (77%), and more than half (57%) use Scala
- And more than half of respondents (52%) use at least one of Flink’s libraries
Flink Satisfaction and Evaluation Criteria
We asked users to share overall Flink satisfaction as well as satisfaction with different components of Flink by selecting one of: Completely Satisfied, Very Satisfied, Moderately Satisfied, Slightly Satisfied, or Not At All Satisfied.
- Overall satisfaction: 70% of Flink users are either Completely Satisfied or Very Satisfied with Flink
- Component-specific satisfaction: “Throughput and Latency” (89%) and “Event time handling” (85%) led the way with percentage of respondents who are either Completely Satisfied or Very Satisfied. “Support for SQL and Python” (21%) and “Monitoring & Operations” (19%) are at the bottom of this list.
So it’s not surprising, then, that two performance-related criteria, “Latency” (36%) and “Throughput” (27%) were attributes rated as most important when evaluating and deciding whether to work with Flink.
Lastly, let’s look at satisfaction from a different angle: the biggest challenges that users encounter when working with Flink. Exactly half of users who ranked “Combining historic and real-time data into a single stream” as one of their Flink challenges put it at the top of their list, and 44% of users who ranked “Management and maintenance of long-running Flink jobs” as a challenge listed it as #1.
Flink Ecosystem
Next, let’s get a sense of how Flink fits into the broader ecosystem.
- When it comes to getting data in and out of Flink, Apache Kafka is the clear leader, with 77% of respondents using Kafka as a source or sink. Next on the list is HDFS, in use by 57% of respondents.
- First, what else did users evaluate as alternatives when choosing a stream processor? Spark Streaming led the way (86%), a logical result given the popularity of Apache Spark as a batch processor, followed by Apache Storm (53%), the first widely known open-source distributed stream processor.
- As of late 2016, on-premise deployments using YARN (45%) and standalone mode (41%) were most popular among respondents, but it’s worth noting that the resource manager space is evolving quickly, and Flink 1.2 will introduce improved support for Mesos and other deployment models.
- And Cloudera (32%) and Hortonworks (28%) were the two most commonly-used commercial software distributions, with no company holding a clear majority.
Flink User Profile
Lastly, here are some of the characteristics of Flink users who responded to the survey.
- A majority of respondents identified their role as “Engineering / Application Development” (54%), with “Data / Systems Architecture” next on the list at 22%.
- And more than ⅔ of respondents (69%) develop on Unix / Linux, while just over half (51%) develop on a Mac environment–note that it was possible for respondents to submit more than one answer to this question.
- “Software” was by a large margin the most common industry among respondents (51%), followed by “Internet” (29%), “Telecomm” (15%), and “Finance” (10%). The size of respondents’ organizations varies, with the largest share (25%) in a company with more than 100 but fewer than 1000 employees. Over ⅓ of respondents (34%) work in an organization with 1000 or more employees.
Thanks for tuning in. That’s it for part 1, and we’ll publish part 2 next week.