There is a great post from 1994 about black triangles in the context of building large and complex systems. You might think that 1994 sounds like the dark ages, but the principles outlined in that post do stand the test of time. Here is my favorite excerpt from that post:
What she later came to realize (and explain to others) was that the black triangle was a pioneer. It wasn’t just that we’d managed to get a triangle onto the screen. That could be done in about a day. It was the journey the triangle had taken to get up on the screen. It had passed through our new modeling tools, through two different intermediate converter programs, had been loaded up as a complete database, and been rendered through a fairly complex scene hierarchy, fully textured and lit (though there were no lights, so the triangle came out looking black). The black triangle demonstrated that the foundation was finally complete the core of a fairly complex system was completed, and we were now ready to put it to work doing cool stuff. By the end of the day, we had complete models on the screen, manipulating them with the controllers. Within a week, we had an environment to move the model through.
When I build large systems with lots of moving pieces, I figure out the black triangle analog for that specific problem. I ask myself questions like:
All this sounds very abstract, so let's look at a concrete example from one of my recent projects, Sloth.
The core purpose of sloth is to simulate failure conditions to verify large and complex distributed systems. We started with latency as the first failure condition to simulate. That led me to thinking that the first thing I needed to do was figure out a way to inject slowness between a back-end system and clients that are talking to it.
My company had a home grown RPC framework that is widely used by all teams. I considered adding support for injecting slowness though code changes in that framework. Doing that would make for a pretty solid MVP. But, that would restrict the scope to make it work only for that service framework. It wouldn't apply to other use cases like adding latency between an application and a database, or other backends like rabbitmq or kafka that are not microservices.
The first black triangle lesson here is to invalidate early. It is okay to discard approaches that seem reasonable to try, but wouldn't let you prove the whole idea. So, I didn't spend time trying to change the RPC framework to support latency injection.
After looking around some more, and talking to some other co-workers, I learned about TC. TC is an arcane unix command that allows you to manipulate TCP packets, by adding packet loss or latency at the tcp packet layer. Pretty much all backend systems we used, like databases, queues (rabbit or kafka), http services, and RPC frameworks are built on top of TCP for remote communication. It became evident that building something on top of tc would work. I would be able to show that it would work for all sorts of backends.
The first thing I did before writing any real code, was to play around with TC in the command line. I used tc to add latency to various ports on my machine. After that, I wrote some bash scripts using tc, that parameterized the port, network interface and latency value. I used existing services that were easy to spin up, like memcached or mongo. I started these services on my machine, and then used the shell scripts to add latency.
Through the first set of shell scripts, I could add different amounts of latency to outbound traffic on different ports associated with the services I started up. I also wrote scripts to remove latency rules that were added. This was important to do because we wanted to be able to reverse any latency rules added without needing restarts of the clients or servers.
Those few shell scripts were the first full realization of the black triangle for sloth. I didn't have a fully working system yet, but I validated the hardest part of the architecture first. The rest of the work took about two weeks to get to an MVP from that point on. The rest of the time I spent was on details like:
What she later came to realize (and explain to others) was that the black triangle was a pioneer. It wasn’t just that we’d managed to get a triangle onto the screen. That could be done in about a day. It was the journey the triangle had taken to get up on the screen. It had passed through our new modeling tools, through two different intermediate converter programs, had been loaded up as a complete database, and been rendered through a fairly complex scene hierarchy, fully textured and lit (though there were no lights, so the triangle came out looking black). The black triangle demonstrated that the foundation was finally complete the core of a fairly complex system was completed, and we were now ready to put it to work doing cool stuff. By the end of the day, we had complete models on the screen, manipulating them with the controllers. Within a week, we had an environment to move the model through.
When I build large systems with lots of moving pieces, I figure out the black triangle analog for that specific problem. I ask myself questions like:
- What is the smallest amount of work I would need to do that validates that the core pieces of the architecture will solve the problem?
- Is there a way to order the work I am doing, so that the most risky and unknown parts are validated first?
- Are there parts of the system that I can omit in the first version, but add very easily later?
All this sounds very abstract, so let's look at a concrete example from one of my recent projects, Sloth.
The core purpose of sloth is to simulate failure conditions to verify large and complex distributed systems. We started with latency as the first failure condition to simulate. That led me to thinking that the first thing I needed to do was figure out a way to inject slowness between a back-end system and clients that are talking to it.
My company had a home grown RPC framework that is widely used by all teams. I considered adding support for injecting slowness though code changes in that framework. Doing that would make for a pretty solid MVP. But, that would restrict the scope to make it work only for that service framework. It wouldn't apply to other use cases like adding latency between an application and a database, or other backends like rabbitmq or kafka that are not microservices.
The first black triangle lesson here is to invalidate early. It is okay to discard approaches that seem reasonable to try, but wouldn't let you prove the whole idea. So, I didn't spend time trying to change the RPC framework to support latency injection.
After looking around some more, and talking to some other co-workers, I learned about TC. TC is an arcane unix command that allows you to manipulate TCP packets, by adding packet loss or latency at the tcp packet layer. Pretty much all backend systems we used, like databases, queues (rabbit or kafka), http services, and RPC frameworks are built on top of TCP for remote communication. It became evident that building something on top of tc would work. I would be able to show that it would work for all sorts of backends.
The first thing I did before writing any real code, was to play around with TC in the command line. I used tc to add latency to various ports on my machine. After that, I wrote some bash scripts using tc, that parameterized the port, network interface and latency value. I used existing services that were easy to spin up, like memcached or mongo. I started these services on my machine, and then used the shell scripts to add latency.
Through the first set of shell scripts, I could add different amounts of latency to outbound traffic on different ports associated with the services I started up. I also wrote scripts to remove latency rules that were added. This was important to do because we wanted to be able to reverse any latency rules added without needing restarts of the clients or servers.
Those few shell scripts were the first full realization of the black triangle for sloth. I didn't have a fully working system yet, but I validated the hardest part of the architecture first. The rest of the work took about two weeks to get to an MVP from that point on. The rest of the time I spent was on details like:
- Making a daemon in golang
- Storing rule configuration in consul
- Calling out to various TC commands from the daemon to add or remove latency rules.
- Error handling and recovery
- Adding a REST API
The next time you have to build a new complex system, think about the black triangle. What is the equivalent of the black triangle for your problem?
Excellent post. Super helpful. Thank you. I need to redo everything now Dohhhhhhh.
ReplyDelete