ArticlesBlog

What is Apache Kafka®? (A Confluent Lightboard by Tim Berglund)

What is Apache Kafka®? (A Confluent Lightboard by Tim Berglund)


– Hi, I’m Tim Berglund with Confluent. I’d like to tell you what Apache Kafka is, but first, I wanna start
with some background. For a long time now, we
have written programs that store information in databases. Now, what databases encourage us to do is to think of the world in terms of things, things like, I don’t know,
users and maybe a thermostat. That’s a thermometer,
but you get the idea. Maybe a physical thing, like a train, let’s see, here’s a train. Things, there are things in the world. Database encourages us
to think in those terms and those things have some state. We take that state, we
store it in the database. This has worked well for decades, but now some people are finding that it’s better, rather than
thinking of things first, to think of events first. Now events have some state too, right? An event has a description
of what happened with it, but the primary idea is that the event is an indication in time
that the thing took place. Now it’s a little bit cumbersome to store events in databases. Instead, we use a structure called a log and a log is just an ordered
sequence of these events. An event happens and
we write it into a log. A little bit of state, a little bit of description of what happens, and that says, hey, that
event happened at that time. As you can see, logs are
really easy to think about. They’re also easy to build at scale, which historically has not
quite been true of databases, which have been a little cumbersome in one way or another to build at scale. Now Apache Kafka is a system
for managing these logs using a fairly standard, historical term. It calls them topics. This is a topic. A topic is just an ordered
collection of events that are stored in a durable way, durable meaning that
they’re written to disk and they’re replicated. So they’re stored on more than one disk, on more than one server, somewhere, wherever that infrastructure runs, so that there’s no one hardware failure that can make that data go away. Topics can store data for
a short period of time, like a few hours or days or years or hundreds of years or indefinitely. Topics can also be relatively small or they can be enormous. There’s nothing about
the economics of Kafka that says that topics have to be large in order for it to make sense, and there’s nothing about
the architecture of Kafka that says that they have to say small. So they can be small, they can be big. They can remember data forever. They can remember data
just for a little while, but they’re a persistent record of events. Each one of those events
represents a thing happening in the business. Like, remember our user,
maybe a user updates her shipping address or
a train unloads cargo or a thermostat reports
that the temperature has gone from comfy to,
is it getting hot in here. Each one o’ those things can be an event stored in a topic, and
Kafka encourages you to think of events,
first and things, second. Now, back when databases ruled the world, it was kind of the trend
to build one large program. We’ll just build this
gigantic program here that uses one big database all by itself, and it was customary, for a
number of reasons, to do this. But, these things grew to a point where they were difficult to change and also difficult to think about. They got too big for any one developer to fit that whole program
in his or her head at the same time. And, if you’ve lived like this, you know that that’s true. Now the trend is to write lots and lots of small programs, each one
of which is small enough to fit in your head and think about and version and change
and evolve all on its own. And these things can talk to each other through Kafka topics. So each one of these services can consume a message from a Kafka topic, do whatever its computation
is that goes on there, and then produce that message
off to another Kafka topic that lives over here. So that output is now durably
and maybe even permanently recorded for other
services and other concerns in the system to process. So with all this data
living in these persistent real time streams, and
I’ve drawn two of them now, but imagine there are dozens or hundreds more in a large system. Now it’s possible to build new services that perform real-time
analysis of that data. So I can stand up some
other service over here that does some kind of gauge, some sort of real-time analytics dashboard and that is just consuming messages from this topic here. That’s in contrast to
the way it used to be where you ran a batch process overnight. Now it’s possible that yesterday is a long time ago for
some businesses now. You might want that insight to be instant or as close to instant
as it can possibly be. And, with data in these topics as events, that get processed as soon as they happen, it’s now fairly straightforward
to build these services that can do that analysis in real time. So you’ve got events, you’ve got topics, you’ve got all these little services talking to each other through topics. You got real-time analytics. I think if you have those
four things in your head, you’ve got a decent idea of kind of the minimum viable understanding, not only of what Kafka is, which is this distributive log thing, but also of the kinds of
software architectures that Kafka tends to give rise to. When people start building systems on it, this is what happens. Once a company starts using Kafka, it tends to have this viral effect. Right, we’ve got these
persistent, distributed logs that are records of the
things that have happened. We’ve got things talking through them, but there are other systems. I mean, what’s this,
there’s this database. There’s probably gonna be, you know, another database out there that was built before Kafka came along and you wanna integrate these systems. There could be other systems entirely. Maybe there’s a search cluster. Maybe you used some SAS product to help your sales people organize their efforts, all these systems in the business, and their data isn’t in Kafka. Well, Kafka Connect is a tool that helps get that data in and back out. When there’s all these
other systems in the world, you wanna collect data, so
changes happen in the database, and you wanna collect that data and get it written into a topic like that. And now, I can stand up some new service that consumes that data and does whatever computation is on it now that it’s in a Kafka topic. That’s the whole point. Connect gets that data in. Then, that service produces some result which goes to a new topic over here and Connect is the piece that moves It to whatever that external
legacy system is here. So, Kafka Connect is this process that does this inputting
and this outputting and it’s also an ecosystem of connectors. There are dozens, hundreds of connectors out there in the world. Some of them are open source. Some of them are commercial. Some of them are in between, but there are these
little pluggable modules that you can deploy to
get this integration done in a declarative way. You deploy them, you configure them. You don’t write code to do this reading from the database, this writing to whatever that external system is. Those modules already exist. The code’s already written. You just deploy them and Connect does that integration to
those external systems. And let’s think about the work that these things do. These services, these
little boxes I’m drawing, they have some life of their own. They’re programs, right, but they’re gonna process messages from topics, and they’re gonna have some computation that they wanna do over those messages. And it’s amazing, there’s really just a few things that people end up doing, like, say you have messages,
these green messages, you wanna group all those
up and add some field. Like, come up with the total
weight of all the train cars that passed a certain point or something, but only a certain kind of car, only the green kinds of cars. And you’ve got these other, say you’ve got these orange ones here. So, right away, we see
that we’re gonna have to go through those messages. We’re gonna have to group by some key and then we’ll take the group and run some aggregation over it or maybe count them or
something like that. Maybe you want to filter,
maybe I’ve got this topic and, let’s see, make some room for some other topic over here that’s got some other kinda data, and I wanna take all the messages here
and somehow link them with messages in this topic, and enrich, when I see
this message happen here, I wanna go enrich it with the data that’s in this other topic. These are common things. If it’s the first time
you’ve thought about it, that might seem unusual, but those things, grouping, aggregating,
filtering, enrichment… Enrichment, by the way,
goes by another name in database-land, that’s a join, right? These are the things that
these services are going to do. They’re simple in principle to think about and to sketch, but to
actually write the code to make all that happen takes some work, and that’s not work you wanna do. So, Kafka again, in the box, just like it has Connect
for doing data integration, it has an API called Kafka Streams. That’s a Java API that handles all of the framework and infrastructure and kinda undifferentiated stuff you’d have to build to get that work done. So you can use that as a
Java API in your services and get all that done in a scalable and fault-tolerant way just like we expect for modern
applications to be able to do. And that’s not framework
code you have to write. You just get to use it
because you’re using Kafka. Now if I wanted to do
some sort of analysis on the data in this topic and I didn’t wanna put it in the service. I didn’t want to stand up a
Java application to run it. Confluent has a language called KSQL that I can now write some
very SQL-looking thing here. Select from this topic,
you know, group by etc, I won’t write the whole query out, but it’s a SQL-like
language that I can run to be able to do real-time
analysis of data in a topic. Where do those results
go, you’re wondering? Well, I think you’ll
be unsurprised to hear that the output of this query just goes into another topic which I can then do further analysis on, stand up other services to process, use Kafka Connect to
send to a legacy system. It all kinda fits into the platform. And when I say platform,
Confluent Platform is a distribution of Apache Kafka that you can use in a number of ways. Number one, there are open
source and community license components of Confluent
Platform that are free to use. There’s a lot of these connectors. There’s KSQL, components like that. You can just download and use for free. There are other components
of Confluent Platform that come with an Enterprise subscription, things like multi-data center support, Enterprise-grade subscription
and things like that. Even those, of course, if you just want to go download it and use it are
free to use on a single node. So nothing is stopping
you from experimenting even with those subscription features, and, if the idea of
downloading, installing and running on your own infrastructure seems old-fashioned to you, you can also use Confluent Cloud. Confluent Cloud is a fully-managed, serverless, Kafka in the cloud. It lets you get all of
this kinda thing done without thinking about
infrastructure at all. (upbeat music) Now, if you’re a developer
and you wanna learn more, you know the thing to do
is to start writing code. Got a resource for you. It’s called Kafka tutorials. That’s kafka-tutorials.confluent.io. There’s a bunch of executable examples with narrative explaining
what all the code does. They’ll help you learn
the basics of topics and producing and
consuming and Kafka Streams and KSQL, all this kinda stuff. And there are tested running examples ready there for you, with explanation of how to do it all. If you wanna learn more
about Confluent Platform in general, confluent.io
is the place for you. Check those things out. Let us know if you have any questions. And I hope we hear from you soon. (upbeat music)

Comments (3)

  1. This is such a great introduction video on Kafka! Thanks for doing this.

  2. Great explanation about different components available in kafka & confluent 👏

  3. The only problem with Confluent Cloud is that the maximum retention period is currently one month, which means for usages which want to persist their events in-topic forever it isn't a viable option.

Comment here