Intro to Stream Processing
"Turning the database inside out" is an influential article in the data engineering space, leading to the founding of Kafka. Since then, implementations like Redpanda and Redis Streams emerged, spurring a real-time data processing ecosystem.
Vs event-based programming
Similar to event-based programming, stream processing is a programming paradigm that aims to handle events in near real-time or as soon as events happen. One way to classify between the two might be frequency. Streams are continuous sequence of events with a high throughput: instead of many short-lived connections, you simply keep a connection open and wait for events to come.
Vs batch processing
Stream processing can be thought as batch processing with extremely small batch sizes. The batch size does not necessarily have to be one. Messages can be micro-batched by the millisecond, or second, according to your latency requirement and processing power. But gone are the days of "it may take up to 24 hours for this change to reflect"!
We want to construct the best stream processing platform where Rust's unique characteristics truly shine:
Unlike other languages, Rust's async execution is multi-threaded. It allows you to scale up a process with as many threads as needed to fully utilize the CPU for maximum concurrency. You do not need to setup an external queuing system!
As a language with no garbage collection, there is no random point in time where the garbage collector kicks in and causes jitter. When you have a long pipeline, these jitters tend to propagate and amplify downstream. Rust is not automatically low-latency though - you still need to spend considerable effort in optimization. But you will have a good starting point.
Unlike other languages, the recommended way of packaging Rust programs is to static-link everything into one executable - often only sized a few megabytes. And there is no installation or warm-up needed - it spins up immediately, which is a bonus for stream processing.
Low resource usage
Like other compiled languages, Rust uses considerably less memory than a VM based language. And without the need of JIT, Rust also has less CPU overhead.
Again, without GC, Rust programs are less susceptible to "slow memory bloat over a period of days" (technically, it's not a leak). There is less risk of out-of-memory crashes, so you don't have to "restart the process every week". Albeit, you still have to be careful about heap allocations.
Finally, Rust has a great ecosystem of async programming libraries: networking libraries built on async IO, lock-free channels and other data structures to make async programming ergonomic and fun.
Without further ado, let's get started!