Bronto Hosts The Iron Yard

Last week, I was privileged to speak to the students of The Iron Yard Academy about my experience in front-end development. Bronto has a long-standing partnership with the expanding code school in the American Underground campus, and each time we host student tours of our space or sit in the audience for their Demo Day event, it’s exciting to see their drive and innovative spirit firsthand.

The demand for the intensive program that The Iron Yard offers is easy to understand. With the explosion of new trends in the world of programming, along with a job market that favors job seekers with in-demand skills, it’s important to have a curriculum and project work that reflect the current environment.

Things have changed dramatically since I started back in the dot-com years, especially with front-end development. I talked about learning JavaScript during a time when it was considered quite the opposite of the popular language it is today. The resources and tooling were sparse, and it was challenging to build interactive features. Still, since JavaScript is a standardized language, many of the core interfaces, such as MouseEvent, have retained the same basic support over the years. Below is an image drag and drop sample pulled from the archives of dynamicdrive.com. With just a few minor updates, I was able to run it in this pen. You can also check out the original version from way back in ’99.

codepen-animated

The biggest difference with front-end development today is choice. Sure, you can still develop powerful features with bare-metal JavaScript, HTML and CSS, but there are so many great solutions out there to aid your workflow. It’s hard to resist including them in your circle despite the tradeoffs in managing the additional overhead.

At Bronto, we use a lot of Backbone and jQuery. Both are sufficient for implementing sophisticated features, but we also leverage other technologies in our front-end stack:

Screen Shot 2015-03-09 at 8.18.54 AM

The real fun began after my presentation. The students had such a broad range of interesting questions regarding business strategy, company philosophy, architectural concerns and future trends. I was blown away by their intelligence and enthusiasm, and I’m even more excited to see their ideas at the next Demo Day!

Improving Automation in Systems Engineering

When I joined Bronto in 2013, I felt we had a reasonably modern procedure for provisioning new systems:

  1. Work with the requesting team to determine the resources needed.
  2. Define the system resources in Foreman and slather on some puppet classes.
  3. Push the shiny ‘Build’ button and wait.
  4. Tackle all of the fiddly little bits that Foreman wasn’t handling at the time.
  5. Complete peer review and system turnover.

This request might take a business day or two to process, longer if something languished in peer review, exposed some technical debt, or just lead to a yak shave. It was a ‘good enough’ solution in an environment where these sorts of requests were infrequent, and we were well aware of the rough edges that needed to be filed off this process when the time was right.

Knowing that our developers were pushing to transition to a more service-oriented architecture and break down the remaining pieces of the old, monolithic code base, we knew it was time to streamline this process before it became a pain point. Requests for new systems were going to be more frequent and more urgent, and we needed to get ahead of the problem by devoting the time to make things better.

After dredging up the relevant improvement requests from our backlog, we tackled the task of filing off those rough edges by:

Writing custom Foreman hooks: This helped with the worst of the manual tasks and freed us from the pain of having to manually update Nagios, LDAP, and any number of additional integration points within the infrastructure.

Automating the peer review process: Borrowing the idea of test-driven development, we’ve finally reached the point where we have test-driven infrastructure. By writing a set of system tests and launching them from another Foreman hook, we were able to replace manual peer review with automation. Results are then announced in a chat room for cross-team visibility.

Writing an ad hoc API for RackTables: RackTables was a great early solution, but we’re approaching the point where it’s no longer a good fit. Although we’re not quite ready for a new solution to datacenter asset management, being able to programmatically twiddle the information in RackTables was a win.

Creating Architect to batch-provision virtual machines: Since VMs tend to be requested in groups to create a resilient service, we wrote a tool to automate away the repetitive tasks. Architect gathers the system requirements, creates a configuration file, and then selects suitable hypervisors and builds out each of the systems. This has been a huge win when it comes to fulfilling requests for 10 or more systems at once.

While there are additional automation improvements we’d like to make, these efforts have allowed us to more rapidly respond to the needs of our development teams and generate systems in minutes instead of days. System provisioning is a core function of our team, and we are always looking for ways to improve our abilities in that area.

TattleTail, the Event Sourcing Service

TattleTail is a service that we’ve built to record the events flowing throughout Bronto’s platform. TattleTail records these events to allow any team to analyze our event flow without impacting customer-facing systems.

Events at Bronto

At Bronto, our services use an event-driven architecture to respond to application state changes. For example, our workflow system subscribes to all “Contact Added” events that are produced from the contact service. Our customers can then create “Welcome Series” workflows that send each new contact a series of time-delayed messages.

In reality, we have dozens of event types that a growing number of services generate and subscribe to. Organizing the system around events encourages decoupled services and allows us to quickly add new services as required. This event stream is managed by our internal message broker called Spew, which is a subject for another post.

More Events, More Problems

Event-based architectures, and more generally event-based services, don’t come without a cost. The most obvious risk is the loss of events. This can happen because of a message queue failure or a bug in application code, and it is guaranteed to happen at some point.

Tracing the path from an event source to downstream state changes is difficult. Events can also be received out of order, which can lead to surprising issues with application state.

While adding new services is straightforward (just subscribe to the events you want), we still have to contend with the “cold start” problem when we launch a service. When a service comes online, it generally isn’t useful if it has no data, and no one wants to wait months to accumulate events before they deploy.

The Solution

One of the simplest ways to deal with the challenges introduced by events is to keep a copy of everything that happens in the system. This isn’t a new idea. Martin Fowler called it Event Sourcing, and the Lambda Architecture uses an immutable record of all events as the primary data store.

Having a record of everything in your application can be great for debugging since events can be replayed through a development environment to identify a bug. It can also make fixing bugs easier by allowing state to be recomputed and repaired after the fact. (This really only works if your application uses immutable data; trying to repair a global counter would not be enjoyable.)

Additionally, an event record helps with the cold start issue. A service can just process old events in batch to jump-start its data store. Existing services can use the record to add features, such as indexing new fields or tracking metrics on different dimensions.

Our first implementation of event sourcing relied on Apache HBase as a durable queue. Events we wanted to record were written to a table, and a nightly MapReduce job wrote the data to the Hadoop File System (HDFS) as SequenceFiles. This solution was adequate, but the output format wasn’t easy to consume, so using it for repairing data wasn’t feasible.

Enter TattleTail

Finally, we arrived at a solution we named TattleTail, which combines our message broker, Spew, and Apache Flume. TattleTail is really quite simple. A custom Flume source (producer) subscribes to Spew events, and an HDFS sink (consumer) efficiently writes batches of events to HDFS.

The most complicated part of logging the events involves partitioning events in HDFS for efficient retrieval. We don’t want to scan the entire repository when we need to find the last 30 days worth of data. Every event in Spew includes a header that describes the message channel it came from, and the HDFS sink from Flume supports using this event metadata to write to different directories. All the Flume source does is extract the headers from the Spew event and copy them to Flume’s event format. The HDFS sink handles the rest.

TattleTail makes several improvements over our previous HBase backed system. First, it receives every event, not just a subset that is difficult to expand. This allows us to get a complete record of events which should make it usable by a wider variety of services.

More importantly, TattleTail uses the same serialization as all the messages flowing through Spew. (Our previous solution used a custom versioned TSV format.) This means that application code that already receives Spew events should be able to consume an HDFS-backed stream with little difficulty.

Looking Forward

TattleTail offers several benefits to Bronto. Teams will be able to access production data for debugging, data repair, and ad hoc analysis. From a business perspective, we can search for patterns in customer behavior and find opportunities for new features.

There is still work to be done to realize these benefits. Right now, TattleTail’s data is only accessible via MapReduce jobs. We hope to add Pig support soon, and we are exploring the use of Spark for data analysis. Another area for expansion is the use of TattleTail to keep a copy of our events offsite as part of a disaster recovery mechanism with shorter mean time to recovery.

Working in a Culture of Trust

You’ve probably heard that building a healthy, thriving DevOps culture requires trust.  At first glance, the statement seems self-evident. Trust should be a part of any healthy team, DevOps or not. Surely we all trust each other enough to get the job done. Besides, it’s too difficult to get any real metrics on trust to be too concerned about it.

Trust is vital. Trust is perhaps the most critical component of a DevOps culture.   Continue reading

DevOps: Year One

I had the pleasure of addressing the Triangle DevOps meetup group on the subject of DevOps: Year One. The target audience was anyone who has bought into the idea of DevOps transformation for their business, but wanted practical advice for how to get started.

With only an hour to speak, and so many great questions to answer, we barely got to scratch the surface. But we did get to talk about some specific practices that have helped the Systems Engineering team at Bronto to work much more effectively.

Continue reading

Four Rules of Building RPMs

At Bronto, we’re heavily invested in open source technologies. We have dozens of MySQL shards, we run Hadoop and HBase, and we’ve built out a production infrastructure on CentOS.

One of the benefits of open source is that the world keeps improving the software you base your business on. You do need to regularly get those improvements and upgrades into production, though. As a result, we tend to roll our own RPMs somewhat often. Here are a few high-level rules we use to (mostly) achieve the zen nirvana of a stable environment of latest version software. Continue reading