TattleTail is a service that we’ve built to record the events flowing throughout Bronto’s platform. TattleTail records these events to allow any team to analyze our event flow without impacting customer-facing systems.
Events at Bronto
At Bronto, our services use an event-driven architecture to respond to application state changes. For example, our workflow system subscribes to all “Contact Added” events that are produced from the contact service. Our customers can then create “Welcome Series” workflows that send each new contact a series of time-delayed messages.
In reality, we have dozens of event types that a growing number of services generate and subscribe to. Organizing the system around events encourages decoupled services and allows us to quickly add new services as required. This event stream is managed by our internal message broker called Spew, which is a subject for another post.
More Events, More Problems
Event-based architectures, and more generally event-based services, don’t come without a cost. The most obvious risk is the loss of events. This can happen because of a message queue failure or a bug in application code, and it is guaranteed to happen at some point.
Tracing the path from an event source to downstream state changes is difficult. Events can also be received out of order, which can lead to surprising issues with application state.
While adding new services is straightforward (just subscribe to the events you want), we still have to contend with the “cold start” problem when we launch a service. When a service comes online, it generally isn’t useful if it has no data, and no one wants to wait months to accumulate events before they deploy.
One of the simplest ways to deal with the challenges introduced by events is to keep a copy of everything that happens in the system. This isn’t a new idea. Martin Fowler called it Event Sourcing, and the Lambda Architecture uses an immutable record of all events as the primary data store.
Having a record of everything in your application can be great for debugging since events can be replayed through a development environment to identify a bug. It can also make fixing bugs easier by allowing state to be recomputed and repaired after the fact. (This really only works if your application uses immutable data; trying to repair a global counter would not be enjoyable.)
Additionally, an event record helps with the cold start issue. A service can just process old events in batch to jump-start its data store. Existing services can use the record to add features, such as indexing new fields or tracking metrics on different dimensions.
Our first implementation of event sourcing relied on Apache HBase as a durable queue. Events we wanted to record were written to a table, and a nightly MapReduce job wrote the data to the Hadoop File System (HDFS) as SequenceFiles. This solution was adequate, but the output format wasn’t easy to consume, so using it for repairing data wasn’t feasible.
Finally, we arrived at a solution we named TattleTail, which combines our message broker, Spew, and Apache Flume. TattleTail is really quite simple. A custom Flume source (producer) subscribes to Spew events, and an HDFS sink (consumer) efficiently writes batches of events to HDFS.
The most complicated part of logging the events involves partitioning events in HDFS for efficient retrieval. We don’t want to scan the entire repository when we need to find the last 30 days worth of data. Every event in Spew includes a header that describes the message channel it came from, and the HDFS sink from Flume supports using this event metadata to write to different directories. All the Flume source does is extract the headers from the Spew event and copy them to Flume’s event format. The HDFS sink handles the rest.
TattleTail makes several improvements over our previous HBase backed system. First, it receives every event, not just a subset that is difficult to expand. This allows us to get a complete record of events which should make it usable by a wider variety of services.
More importantly, TattleTail uses the same serialization as all the messages flowing through Spew. (Our previous solution used a custom versioned TSV format.) This means that application code that already receives Spew events should be able to consume an HDFS-backed stream with little difficulty.
TattleTail offers several benefits to Bronto. Teams will be able to access production data for debugging, data repair, and ad hoc analysis. From a business perspective, we can search for patterns in customer behavior and find opportunities for new features.
There is still work to be done to realize these benefits. Right now, TattleTail’s data is only accessible via MapReduce jobs. We hope to add Pig support soon, and we are exploring the use of Spark for data analysis. Another area for expansion is the use of TattleTail to keep a copy of our events offsite as part of a disaster recovery mechanism with shorter mean time to recovery.