Skip to main content

Kafka Clients 101

In this post, we will look at packages to build a real time notification system for GraphSpace. GraphSpace is built in Django. The requirements for our system are
  • Decoupled from the main app ,i.e, should be non-blocking
  • Use Apache Kafka as the message queue system
  • Follow a producer-consumer architecture with broker, Kafka
  • Able to handle different types of notification: Group, Owner and Watching
The above requirements can be satisfied if we have a package that can connect to Kafka, have an ability to handle different types of notification and is non-blocking. To handle different type of notifications we can have producers and consumers send and receive messages with different topics. So the package we need should be able to dynamically do so. Instead of having say 10 producers for 10 topics and establishing 10 connections to Kafka, it is better we have 1 producer which can send messages with 10 different topics. Similarly 1 consumer should be able to subscribe to 10 different topics. By making the calls asynchronous and running in a different thread, we can make our app non-blocking. The following are packages which satisfy at least 1 requirement.

Packages

The following is a list of packages/techniques we can use to develop a real-time notification system for Graphspace. These include a brief description and the merit and demerits of these packages/techniques.

Kafka-python

Python client for the Apache Kafka distributed stream processing system. kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces (e.g., consumer iterators). This is the slowest and most mature client in this list. We can have 1 producer to send messages to different topics and have 1 consumer subscribed to a list of topics. Having 1 consumer listen to multiple topics helps us in using the same message handling code for different type of notification. PS : I love modular code. This being a mature client finding answers to problems is comparitively easy.
Support: Kafka >= 0.9 (backward compatible with v0.8)

PyKafka

A high level and low-level producer-consumer client for Apache Kafka in Python. The major issue with PyKafka is lack of multi-topic support. We have to create different consumers and producers for different topics. This makes it difficult to have dynamically created topics. This will be a requirement in the notification system for GraphSpace as we explore more complicated notifications. This package is faster than the well known kafka-python.
Support: Kafka >= 0.8.2

Confluent-kafka-python

A high-level producer and consumer client for Apache Kafka in Python. This is a lightweight wrapper around librdkafka. Hence its incredibly fast. Although it is the newest package in this list, it has support from Confluent. The consumer can be subscribed to more than one topic and producer can send messages to more than one topic. I like this one.
Support: Kafka 0.9

Performance test by Activision Game Science for the above three tools are as follows

Producer performance

time_in_seconds
MBs/s
Msgs/s
confluent_kafka_producer
5.450890
17.495754
183456.277455
pykafka_producer
57.318527
1.663815
17446.365994
pykafka_producer_rdkafka
15.724413
6.064928
63595.378094
python_kafka_producer
67.855882
1.405441
14737.115900

Consumer performance

time_in_seconds
MBs/s
Msgs/s
confluent_kafka_consumer 3.825439 24.929801 261407.908007
pykafka_consumer 29.431728 3.240293 33976.938217
pykafka_consumer_rdkafka 6.086001 15.669966 164311.503412
python_kafka_consumer 26.547753 3.592298 37667.971237

Aiokafka

An asyncio client for Kafka. This package is based on kafka-python. All the consumer and producer functions are asynchronous. So we do not have to handle multi-threading. Asyncio is a Python 3 package and hence we cannot use it. GraphSpace supports Python 2.7.10. 
Support: Kafka 0.10

Approach

The approach that I have planned is similar to the figure given below, using Apache Kafka instead of RabbitMQ. We will also not be using Tornado server.
The package I have decided to use is confluent-kafka-python (because I love speed). I will be creating a different app in Django for the notification system. The consumer will be running on a parallel thread. I will be writing a decorator as given in this StackOverflow answer. The main application is the producer and the notification system is the consumer.

Comments

  1. Thank you for sharing such valuable information and tips. This can give insights and inspirations for us; very helpful and informative! Would love to see more updates from you in the future.
    Big Data Administrator Training in Chennai
    hadoop big data training in chennai

    ReplyDelete
  2. Excellent blog, thanks for taking time to share this valuable information. Really helpful to me.
    Hadoop Training in Chennai | Best Hadoop Training Institute in Chennai

    ReplyDelete

Post a Comment

Popular posts from this blog

Django + Kafka

Q & A of connecting Apache Kafka and Django
Problem #1 Kafka consumer should always be listening for new messages in the queue. The consumer should be running in parallel to GraphSpace app (producer). Solution There are multiple ways to do this. We create another Django project say called "GraphSpace notification consumer" which starts along the GraphSpace application and establishes a connection with Kafka. This is an overkill for a simple consumer.We can use celery for multi-threading. Celery itself uses Redis or RabbitMQ as a queue for tasks. This setup might work for a large scale multi-threading system but for a simple setup of running a consumer this is a overkill. We will be running 2 queues (Redis and Kafka), one for queuing task and another for queuing content. I tried this setup with the following architecture:
I will recommend keeping the architecture simple, run the consumer code on a different thread in daemon mode. Problem #2 Where to start this consumer th…

Websockets in Django

Q & A

What are websocket? Why do we need it?
Quoting from wikipedia,
"WebSocket is a computer communications protocol, providing full-duplex communication channels over a single TCP connection. The WebSocket protocol was standardized by the IETF as RFC 6455 in 2011, and the WebSocket API in Web IDL is being standardized by the W3C."
WebSockets are used for streaming messages. To implement real-time notification we need to stream messages from the server to the client without a refresh or making an HTTP request from the client. Hence websockets.

How do we implement websockets in django?
We will be using django-channels (channels) for websockets. Django, by default does not support websocket. Channels is a django project which allows Django to handle websockets, HTTP and HTTP2 requests. But how does channels implement websockets in WSGI server (gunicorn) which does not support websockets. Simple we don't. Instead, we will be using Daphne, an interface server designed fo…