2019-01-20

Kafka

Kafka的学习笔记--介绍（总体概念）

前言

最近项目要用到 Kafka，要研究一下这个，先得总体对它有个了解，而个人经验，想要获得总体了解，最好的办法就是阅读维基百科，或者阅读官网的介绍，这次我来阅读官网的介绍。

我觉得，反复研读，充分理解这个介绍，可以提纲挈领，之后再配合对各个模块具体的研究，可以很快理解和掌握。

下面摘抄了正文，配合上总结的一定的段落大意，便于快速抓住要旨。

博客

原帖收藏于IT老兵驿站。

正文

Introduction

Apache Kafka® is a distributed streaming platform. What exactly does that mean?

kafka 是一个分布式的流化平台。怎么理解这个意思呢？

A streaming platform has three key capabilities:

Publish and subscribe to streams of records, similar to a -message queue or enterprise messaging system.

Store streams of records in a fault-tolerant durable way.

Process streams of records as they occur.

一个流化平台需要有三个关键的能力：

对记录流的发布和订阅，类似一个消息队列或者企业级消息系统。
采用一种持久容错的方式来存储记录流。
当记录流产生的时候处理它。

Kafka is generally used for two broad classes of applications:

Building real-time streaming data pipelines that reliably get data between systems or applications

Building real-time streaming applications that transform or react to the streams of data

Kafka通常被用于两种广泛的应用类型：

建立实时流化数据管道，在系统或者应用程序中可靠地获取数据
建立实时流化应用程序，对流化数据来转换或者响应

To understand how Kafka does these things, let’s dive in and explore Kafka’s capabilities from the bottom up.

First a few concepts:

Kafka is run as a cluster on one or more servers that can span multiple datacenters.

The Kafka cluster stores streams of records in categories called topics.

Each record consists of a key, a value, and a timestamp.

一些新的概念：
Kafka 作为一个集群来运行，在一台或者多台服务器上，可以跨越多个数据中心。（后面这句怎么理解？是不是是指每一个服务器上都储存着一份完整的数据）
Kafka 集群存储着流化记录在被称为 topics 的分类中。
每一个记录包含着一个 key，一个 value，和一个 timestamp。

Kafka has four core APIs:

The Producer API allows an application to publish a stream of records to one or more Kafka topics.

The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.

The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

4个核心API：

Producer API 允许一个应用程序去发布一个记录流到一个或多个 Kafka topics 中。
Consumer API 允许一个应用程序去订阅一个或多个 topics，并且处理产生给它们（前面的topics）的记录流。
Streams API 允许一个应用程序去扮演一个stream processor（流化处理器），从一个或多个topics中消费一个输入流，并且产生一个输出流到一个或多个输出的 topics 中，有效转换输入流到输出流。
Connector API 允许建造和运行一个可重用的生产者或者消费者来连接 Kafka topics 到一个existing 的应用程序或者数据系统。例如，一个连接器到一个关系型数据库可以捕获对一个表的每一次变化。

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.

Kafka 协议是用 TCP 协议构建，与语言无关，版本化的，兼容过去老的版本。

Topics and Logs
Let’s first dive into the core abstraction Kafka provides for a stream of records—the topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

一个 topic 是一个类别或者 feed name（不知道怎么解释，应该是类似 RSS 的概念，一个可以被订阅的东西的名字），用来发布记录的。Topics 通常是多订户的，就是说，一个 topic 可以有零个，一个，或者很多消费者来订阅写在它里面的数据的。

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

每一个 partition 是一个顺序的，不可变顺序的记录，可以被连续追加的结构化的 commit log（提交日志？）。每条在 partition 中的记录被分配一个 sequential id number，被称为offset，是在 partition 中的每条记录唯一的标识符。

这里留有了几个问题，如何划分 partition，如何对 partition 进行读写？consumer 是针对partition来消费的？

partition 是由 Kafka 来划分的，它来使用 Round-Robin（单循环的方式）来进行写。（奇怪，这句话是哪里找来的，不是原文中所提到的，这个地方是可以，而不是一定采用 Round-Robin，还可以设置一个key，这样去做分配）

The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.

无论记录是否被消费，Kafka 集群都会一个可配置的保持周期留存所有发布的记录。

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from “now”.

这段比较关键，事实上，唯一保留在每一个消费者端的元数据就是这个 offset，偏移量，这个偏移量是用户端的，即每个用户读到哪里了。

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.

consumer 是非常便宜的–这里的意思是 consumer 可以很容易的加入或者离开，并没有对其他consumer 产生太多的影响。

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

partitons 有很多用途。可以使日志分布在多台服务器上，这样一个 topic 可以处理更多的数据–这里的意思是一个 topic 可以通过 partition 放在多台 server 上，从而实现 topic 数据的横向扩展？
那每个 partition 又被 ZooKeeper 进行备份，那这里是一个多对多的关系了。
其次，它们扮演者并行单位的角色–在某种程度上，更多要依靠这一点。

Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

这段讲了 sever 和 partition 的关系，这里说 partitions of the log，（日志的partition，这是说Kafka其实是基于日志的？是的，commit log）每一个 server 处理着数据和请求。每一个partition 会在一个可配置数目的 server 中间被复制。（那么一个 partition 会被分在多个server上吗？不会，每个partition在每个server上是完整的）server在交替扮演者 leader（领导者）和follower（跟随者）的角色。leader 为 partition 处理所有读写请求，followers 被动地复制leader。

这里存在了几个问题：

这里的 server 和 cluster 是不是指的是一回事？应该是一个 cluster 包含多个 server。
这里 partition 的一致性维护其实是由 ZooKeeper 提供的吗？ISR 的持久性是通过 ZooKeeper维护的，但是算法又没有使用 ZAB 算法。

Geo-Replication
Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.

Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

Producers 负责在 topic 中选择哪一条记录分配给哪一个 partition。这可以采用 round-robin（循环）方式或者其它方式。

Consumers
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

Consumers 用一个 consumer group 的名字来表示自己，每一条发布到一个 topic 的记录被传递到（不是拖的方式吗？这里这么说感觉像是推的方式了，可能方式本身并不重要，也可能是两种方式混合在一起，为了达到目的）每一个订阅的 consumer group 的一个 consumer 实例。Consumer instances可以在单独的进程或者单独的机器上。

这里 consumer instance 和 consumer group 都分别指代什么呢？一个 group 中会有多个instance。consumer group 对应着 partition。

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

上面两句有些含糊，consumer instance 和 consumer group 的关系，和 consumer processes的关系。

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

然而，更为通用的是，我们已经发现 topics 有很小数目的 consumer groups，每一个是“logical subscriber”，逻辑上的订阅者。每一个组由很多 consumer instances 组成，为了可扩展性和容错。除了订阅者是一个 consumer 的 cluster 代替了一个单独的进程之外，这和 publish-subscribe 语法没有任何区别。

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a “fair share” of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka 所实现的消费，是把 log 中的 partition 分散到 consumer instance 上去，以便每一个instance 在任何一个时间点上都是一个排他的，对于 partition 的“fair share”（公平共享）的consumer。维持这个 group 内关系的过程是由Kafka协议动态控制。如果一些新的 instance 加入了 group，它们就会从 group 其它成员那里接管一些 partitions；如果一个 instance 挂掉，它的 partitions 就会分配给剩余的 instance（这里的这个概念，应该就是 rebalance）。

（这说明一个实例对应着一个 partition。）

参考这里：

Having consumers as part of the same consumer group means providing the “competing consumers” pattern with whom the messages from topic partitions are spread across the members of the group. Each consumer receives messages from one or more partitions (“automatically” assigned to it) and the same messages won’t be received by the other consumers (assigned to different partitions). In this way, we can scale the number of the consumers up to the number of the partitions (having one consumer reading only one partition); in this case, a new consumer joining the group will be in an idle state without being assigned to any partition.

Having consumers as part of different consumer groups means providing the “publish/subscribe” pattern where the messages from topic partitions are sent to all the consumers across the different groups. It means that inside the same consumer group, we’ll have the rules explained above, but across different groups, the consumers will receive the same messages. It’s useful when the messages inside a topic are of interest for different applications that will process them in different ways. We want all the interested applications to receive all the same messages from the topic.

一个 group 内的 consumer 互相是竞争的，一个消息被一个 consumer 消费了，group 内的别的 consumer 就不会再收到这个消息了。而 consumer 在不同的 group 中，则每个 group 都会收到这个消息，但是在一个 group 内，还是同样的原则。这样说来，group 对应着 topic。

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Kafka 仅在 partition 中对记录提供总的顺序，而不是在一个 topic 下不同的 partition 之间。对于大多数应用程序而言，按分区排序与按键分区数据的能力相结合就足够了。然后，如果你需要对记录进行总排序，那只能是一个 topic 只有一个 partition，这意味着一个 consumer group只有一个 consumer 进程。

Multi-tenancy
You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation.

多租期，这段还有待理解。

Guarantees
At a high-level Kafka gives the following guarantees:

Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

A consumer instance sees records in the order they are stored in the log.

For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
More details on these guarantees are given in the design section of the documentation.

保证，消息是具有先后性的，最小服务数量，这是说容灾能力，如果复制因子是 N，那么至多可以瘫痪 N - 1 台server。

Kafka as a Messaging System
How does Kafka’s notion of streams compare to a traditional enterprise messaging system?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren’t multi-subscriber—once one process reads the data it’s gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

传统的消息系统有两种模型：队列和发布者-订阅者，前者消息是一对一的，后者消息是一对多的，二者互有优劣。

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

The advantage of Kafka’s model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka has stronger ordering guarantees than a traditional messaging system, too.

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka 采取了两家之长，一个消息可以支持多订户，使用了 partition，增加了并发。

Kafka as a Storage System
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

这段比较重要，Kafka 允许生产者等待确认，以便在完全复制之前写入不被认为是完整的，并且即使写入的服务器失败也保证持久性–这可能是通过复制到了别的服务器上来实现的。

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

For details about the Kafka’s commit log storage and replication design, please read this page.

Kafka for Stream Processing
It isn’t enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

流化处理，把输入流进行处理，转成想要的输出流，这块需要实践一下，才好理解。

Putting the Pieces Together
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka’s role as a streaming platform.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.

总结一下，夸一夸Kafka，就不赘述了。

总结

通过阅读这一篇文档，对Kafka有了一个总体的印象。

参考

http://kafka.apache.org/intro