Kafka-aggregator

A Kafka aggregator based on the Faust Python Stream Processing library.

This site provides documentation for the kafka-aggregator installation, configuration, user and development guides, and API reference.

Before installing kafka-aggregator, you might want to use the docker-compose set up as a way to run it locally, in this case jump straight to the Configuration and User guide sessions.

Overview

kafka-aggregator uses Faust’s windowing feature to aggregate Kafka streams.

kafka-aggregator implements a Faust agent (stream processor) that adds messages from a source topic into a Faust table. The table is configured as a tumbling window with a size and an expiration time. Every time a window expires, a callback function is called to aggregate the messages allocated to that window. The size of the window controls the frequency of the aggregated stream.

kafka-aggregator uses faust-avro to add Avro serialization and Schema Registry support to Faust.

_images/kafka-aggregator.svg

Figure 1. Kafka-aggregator architecture diagram showing Kafka and Faust components.

Summary statistics

kafka-aggregator uses the Python statistics module to compute summary statistics for each numerical field in the source topic.

Summary statistics computed by kafka-aggregator.

mean()

Arithmetic mean (“average”) of data.

median()

Median (middle value) of data.

min()

Minimum value of data.

max()

Maximum value of data.

stdev()

Sample standard deviation of data.

q1()

First quartile of the data.

q3()

Third quartile of the data.

Scalability

As a kafka application, it is easy to scale kafka-aggregator horizontally by increasing the number of partitions for the source topics and by running more workers.

To help to define the number of workers in a given environment, kafka-aggregator comes with an example module. Using the kafka-aggregator example module, you can initialize a number of source topics in kafka, control the number of fields in each topic, and produce messages for those topics at a given frequency. It is a good way to start running kafka-aggregator and to understand how it scales in a particular environment.

Installation

Configuration

User guide

Development guide

API

Project information

The GitHub repository for kafka-aggregator is https://github.com/lsst-sqre/kafka-aggregator

See the LICENSE file for licensing information.