Transcript
Peterson: I do want to talk about code. I’m a recovering security person, but I’m an engineer at heart. I’ve been obsessed with trying to make efficient software for a long time. Cost is this very interesting metric that I think a lot of us as software engineers forget about. If you’re operating in the cloud, it’s a pretty dangerous thing to forget about. We’re going to get into some examples.
I’ve been building on AWS for a long time. I’ve also worked at the other cloud providers. This talk is going to talk a little bit about AWS specifically, but I’m an equal opportunity cloud user. I enjoy using all the flavors, so don’t throw me under the bus for that. I founded CloudZero in 2016 specifically to help empower engineers to get their arms around the cost of their software. I want everyone to be engineering profit at the end of the day.
Every Engineering Decision Is a Buying Decision
I want us all to remember this statement, that every engineering decision is a buying decision. There’s more scrutiny that’s going to be directed at the money that you’re spending on this event, or what you spent on dinner or lunch today than how much money is getting spent in the cloud. Somebody somewhere in accounting is staring at that $50 lunch, nobody necessarily is staring at the $10,000 that your engineers are spending on cloud computing. It’s hard to fathom this because the CTOs, the CIOs, the CFOs, they used to be in charge of the procurement process, the purchasing process. Now, today, all of you, if you’re engineers, if you’ve written a line of code that is run in the cloud, you’ve made a buying decision. You might not have known about the cost of that buying decision, but I’d like to change it.
As luck would have it, the world is talking about things that I think are pretty relevant to what I wanted to talk about. This is exactly where a lot of people are. They’re wondering, maybe the cloud was a mistake. Maybe it’s a scam. Maybe if we get in there, we’re going to discover that we need to get out quick, so let’s try to build everything the old way, or try to figure out how to operate in a way that doesn’t embrace all the features and services that you have. You end up discovering that lift and shift, this all seems like a lie. It’s super expensive. What is going on? It’s a scam. It’s a total self-fulfilling prophecy. We need to think differently about this, because in my mind the cloud was really an operating system, a new operating platform. We’re all writing code today for yesterday’s mainframe, not realizing we need to really rewrite this for the cloud, if we’re going to take maximum advantage of it. This is what we’re doing. This is how the DevOps movement started. It worked fine for us, let’s just throw it over the wall to ops, no problem. Now we write code, and it works perfectly, and we throw it off for finance to worry about it. That’s the new challenge for most of us today living in this economy, but in general, just trying to build great software. I want at the end of the day, all of us to just build great software.
Some people would like to build all that software in the cloud. We’ll see how fast that happens. This guy seems to believe that it’s all going to happen at some point. The question is when. There’s $4.6 trillion, depending on who you ask, about stuff that’s happening that they are still running in the data center. The cloud, let’s not forget, despite all its growth, it’s still very early days. We still have a lot to figure out. If it’s going to move to the cloud, it’s got to make strong economic sense, but some people are convinced that this isn’t going to happen. Some people are actually convinced that it was all a mistake. I happen to think these people are horribly wrong. I live in the cloud after all. I want to stay there. This conversation is happening because it hasn’t been absolutely abundantly clear to all of us as we sit down to build things that the cloud makes strong economic sense. We have to change that. I just wanted you to all know, I’ve checked the numbers, and I’m convinced it does make strong economic sense. I’ve seen it. You’re going to have to build differently. You’re going to have to write your code differently. You’re going to have to think about systems design differently. You’re not going to be able to take the thing that worked in the data center, and just lift and shift it into the cloud, and expect a good outcome. You’ve got to think differently about it. We’re going to actually get really focused in on some very specific lines of code, where it would have helped to have thought a little differently.
The Engineer’s Role in Software Profitability
What is your role? How many people write software for a living? How many people write software in the cloud for a living? We’re all in the right place. Cost efficiency today, it often reflects system quality. A well architected system is a cost-effective system. One line of code can come down to whether or not the company you’re working for is profitable or not. We have a challenge that we’re going to figure out together during this talk about what is the best way to measure cost efficiency. Before we get into all of that, I just want to get into the code. I’m an engineer at heart, after all. These are all real examples, but I had to protect the witnesses. It’s all been normalized and translated into Python. It’s all been rewritten to protect the innocent. These are all things that resulted in people spending way more money than they should have with just a few lines of code.
Example 1: Death by Debug (Even DevOps Costs Money)
I’m going to start with my favorite example. I’ve got five examples for you. Here’s the situation. AWS lambda function with an average monthly cost of $628. CloudWatch, average monthly cost of $31,000. What’s happening there? We just heard in the previous conversation about CloudWatch costing way more than the actual invoke of a lambda function. I don’t know how many people have experienced this in their lives, but this happens all the time. For this particular example I’m going to show you, this single function since deployment it cost $1.1 million in writing this data. What was the cause of it? Was it engineering necessarily jumping in there? It was a combination of two things. Code that shouldn’t necessarily have left the building. Something that did something important. A well-intentioned debug line, that when the ops team turned on debug logging, and didn’t think much about it, echoed everything into CloudWatch. Because ops is sometimes disconnected from dev, we’d love to think that DevOps is always working together, but nobody knew about it. They just assumed that it worked that way. It ran for a long time, $1.1 million. What’s the fix for this? This one’s pretty simple. Start us out easy. We know that’s the problem. That’s a great thing if I’m writing some sample code on my desktop, but the fix is just delete this. We don’t need this. The developer, if they’re building this code, when they’re testing it out, it’s great and helpful at that time, but when you deploy it, don’t put this in there. It’s like a vulnerability. It’s a bomb waiting to go off. The fix is to delete it.
Example 2: The API Costs Money
I want to give another example. Next situation, we’ve got an MVP that found its way into production. The product years later now is creating billions of API requests to S3, and it’s grown nice and slow, so that nobody noticed the whole time along. The total cost of this code over just one year is $1.3 million. Does anybody want to spot some of the challenges in this code? There’s many, and there’s a few. This worked as an MVP perfectly. In fact, it’s great. Get an idea, put it on paper, make it happen. Deliver it. Why are these things inside the for loop? Why on earth are we calling out to the S3 APIs while this is running out? We could actually pull all this stuff out of there and easily cache or capture this information. The problem is this code works. When it got deployed, it worked just fine. It wasn’t until years later when it was up to scale that it started to cost that $1.1 million. We also got a little detail here. Maybe I shouldn’t be passing these files to do further processing off to my next function. What’s the fix for this one? Let’s pull that outside of the for loop. Anybody know what reticulating a spline is? Pull this stuff outside of the for loop. Calculate the stuff or download the stuff in advance, do it once instead of the million times that we’re running inside of this function. Instead of passing just some pointers to the files to go look up later, pass the actual data. Use it once. Simple stuff. Again, we’ve all done this, where we got the code working. It worked as a prototype. Then it snuck out the door and we never thought about it ever again. API calls cost money. A lot of times, S3, the API calls might cost more than the storage.
Example 3: How To 2x DynamoDB Write Costs with a Few Bytes
This next example is one of my favorites. I don’t actually know how much would this cost to run, but I do know that it is certainly an expensive mistake that I’ve made before that I’ve seen others make. When I learned about it from Khawaja, I just had to include it in the examples. This is an example of how a developer was asked to add something very simple. This record that we’re writing to DynamoDB doesn’t have a timestamp, we’d like to know when it was actually written. Why don’t you just add that field? It should be super easy. Code change that takes a second, somebody tested it, deployed it, it’s up and running. Look at the bill shortly there later. DynamoDB costs have just gone up 2x. This one’s a little harder to spot. Does anybody see why adding that timestamp line, that single line made DynamoDB cost two times as much as it did before? DynamoDB charges in 1k elements for writes. It’s 1000 bytes being written, but we added timestamp, that’s an attribute, that’s 9 bytes. We added a timestamp in ISO format, that’s 32 bytes, 141 bytes, 2x the cost, one line of code. Pretty hard to spot that one. We have to think differently about how the data is flowing across the line. More importantly, how that affects our costs. What’s the fix for this one? Actually, we should do two things. We should reduce the size of the attribute name. As Khawaja mentioned to me earlier, it makes it more aerodynamic when it flies through the wire. It’s a very important property of the TCP/IP protocol. Make that a ts, instead of timestamp. Let’s shave off a few bytes there. Let’s reformat our timestamp so that we’re down to 20 bytes. Good news, we got 2 bytes to spare, we’re under the wire, so back to where they needed to be. One line of code, 2x the cost.
Example 4: Leaking Infra-as-Code (Terraform Edition)
Let’s not forget about infrastructure as code, Terraform and CloudFormation. They’re guilty here too. Everything is code these days. We’ve got a Terraform template, it’s creating autoscaling groups. They scale up clusters that have hundreds and maybe even thousands of EC2 instances at a time. Someone designed the system so that it refreshes the instances every 24 hours. Maybe they had a memory leak and they thought that was a good way to fix that. It’s refreshing those instances every 24 hours. Then somebody in security was worried that the data was maybe important, so let’s not delete the EBS volumes. Somebody would go look at that? No, this system ran for about a year, slowly generating cost. At the end of that year, $1.1 million went out the door. This is a little bit verbose, as is all infrastructure as code.
There’s two lines in here that are the culprit that caused this. Somebody turned off, delete_on_termination. Somebody set the max_instance_lifetime to that refresh time. It’s in two different files too, so they’re not even going to be looking at the same thing necessarily. Those two changes meant that every time that EC2 instance spun up, and it created EBS volume, it left it around after it terminated and refreshed itself. Over time, this is max size of 1000. In this environment, it was somewhere around 300 to 600 EC2 instances at any given moment. That adds up pretty quickly. Over a year, that adds up to just over a million dollars. The fix to this one’s a little more complicated. This one, you have to change your processes. You have to think a little bit about how your team is working when they put these things in place. If you’re going to create resources, you should always also ask the question, how are we going to get rid of it as well? This applies to a lot of things in the cloud, not just the cost. A lot of us have spent the last couple of years thinking about how to scale up but we don’t think enough about how to scale down. Scaling down is way harder, and it’s way more important. It also can save your business. If you were a travel company during COVID, the ones that really survived were the ones who knew how to scale down. I hear amazing things of the team at Expedia, they really knocked that one out of the park. Beware of well-intentioned infrastructure as code, particularly if you’ve got requirements coming in from different teams.
Example 5: Cost Delivery Network
Fifth example, I left the best for last. We love content delivery CDNs. They make all of our stuff go faster. We can move content to our customers better. There’s a particular company that had 2.3 million devices out there all running across the world, and a change was deployed. It got deployed to all those devices. After about 14 hours, the impact of that payload started to become known. It reached a steady state of about $4500 an hour. It took them about 6 days to resolve it. They went through a pretty exhaustive testing. The total cost of that incident was $648,000. If that had continued to run for a year without anyone noticing, and I’ll explain why they may not have even noticed, it would have been $39 million. Somebody somewhere would have been begging for that money back, I’m sure. Hopefully somebody would have seen it in finance after the first month. That would have been after the first month, 6 days, $648,000. That’s a pretty painful bug. From their point of view, this was a success story because they found this within 6 days. Normally, it would have taken weeks or months. Let’s just say that’s a really messed up barometer for what’s a success story, unfortunately.
What’s the code? It looked something like this. They had a well-intentioned update code that was probably written by an intern a long time ago, and they updated it. Because it used to call once a day, and it would download and it would compare this hash, and like, that sounds like a bad idea. Why don’t we download some metadata so that we don’t necessarily need to do this? It was actually designed to lower costs. They rolled it out. They expect everything to go in the right direction. They weren’t quite certain what happened next. Then, suddenly, they discovered it wasn’t quite working the way they expected. How many people can spot the bug, despite the other issues with this code? How many people can spot the bug in this code? It’s just one single character. Because with that one single character, it meant that this code flipped to the more expensive path. At the same time, they moved this up to calling up every hour instead of every day or two. They were off to the races. CloudFront was happy. They showed up. It did a great job. It scaled up. It delivered that content. Nobody was harmed. Everybody was happy the data is flowing. People might have been wondering why all this data is flowing around their home networks for these devices, but it worked. The operations teams and the monitoring tools behind the scenes, no errors, nothing to alert them. Datadog is not telling them anything’s wrong, because CloudFront is happily swallowing the traffic. Everything’s working. The only indicator of this is that they are now spending $4,500 an hour that they weren’t spending before. A single character. The fix for this one really is they went back and rethought this whole thing. It wasn’t just, let’s change one character now. They did a quick fix and then they went back and thought more deeply about it, because this was a pretty important aspect of their product. It is one of those things that is very easy to get in the mode of building out and doing, when you operate not quite at scale, and then you bring it up to scale, some of these things sneak in. This one was a pretty painful single character bug that would have resulted in a $39 million bill.
Lessons Learned
What have we learned from all this? Storage is still cheap. We should really still be thinking about storage as being pretty cheap. Calling APIs, that cost money. It’s always going to cost money. In fact, you should just accept that anything you do in the cloud cost money. It might not be a lot, it might be a few pennies. It might be a few fractions of pennies, but it costs money. You should think about that before you call an API. We all have practically infinite scale now. I have not yet found infinite wallet. We have a constraint that no one’s figured out. CDNs, they’re always going to be great at eating traffic and eating your money. What’s the important takeaway from this, is it that we should now layer one more thing on top of all of us to do all day long? I do spend some time thinking about this, but this sounds pretty painful. We’re just going to have to get all of our engineers agonizing over the cost of their code. They’ve got plenty of time to do all this. This is as true as it ever was.
Premature optimization is always going to be evil, whether it’s performance or it’s cost. The first thing that we have to figure out as engineers often is, can this damn thing even run? Will it even work? Can I even achieve this problem? Because all these examples I’ve shared with you, they’re not actually problems until you get to scale. They’re not actually problems unless you’re successful. They’re not actually issues that you should even care about, unless you might be onto something with the product or the service that you’re building. In my opinion, cloud engineers, they should think about these questions, but they should do it iteratively over time, and not all at once. First answer that question, can it even be done? Then remember, you work on a team. Is this the right way to do it as a team? What happens if this thing becomes popular? Maybe that’s the point where you should start thinking about it. Then, how much money should this cost to run? I don’t know about you. When I first started building some of the first systems in the cloud, I had no idea how much any of this stuff should cost. In fact, when I went to my CFO and said I wanted to use AWS for a project. He said, “Erik, you can do whatever you want, but you have a budget of $3,000. Don’t go spend it all at once.” That was a long time ago. We were a little fearful. I said, I want to play with the cool toys. We used that $3,000 to build a pretty amazing system that we then sold for $800,000. The only reason we did that was because I was suddenly obsessed with trying to maximize my return on that investment. Is that for everyone? Do we want to give engineers a budget? I think actually, we want to give engineers something a little bit more powerful, not a number. Because oftentimes the costs of operating in the cloud equal buying a Lamborghini every single day, which seems completely abstract and bizarre. Instead, let’s focus on efficiency.
Cloud Efficiency Rate
For that, I have a concept called the cloud efficiency rate that I want to share with you. It’s really simple. It’s designed to be simple. Cloud efficiency rate can really guide you into thinking, when is it time to start actually optimizing and not do it too premature? Then you calculate this by taking your revenue minus your cloud costs over your revenue, that gives you your percentage. A simple example. You’re making $100 million as a company, your cloud costs are $20 million, you’re spending 20 cents per dollar revenue, your cloud efficiency rate is 80%. That’s awesome. You don’t necessarily need to get there right out of the gate. A cloud efficiency rate is something that we should think of as a non-functional requirement. Something that for any cloud project, you should have your product managers, your product people, the business, somebody helping guide you to understand what is the desired cloud efficiency rate and at what point in the lifecycle of this application should it be? What should it be as you grow and build that application?
R&D, trying to figure out if it just works. It could be negative. It should be negative. If you’re making money before you even ship the product, something is wrong. Or maybe that’s an excellent business model. Once you get to MVP, try to break even. This is what I would recommend, try to break even. Have a low CER, it doesn’t matter, zero to 25%. Get to product market fit. People are actually telling you that they want to buy your product and it looks like, what happens if this thing actually becomes popular? Let’s try to get to 25% to 50%. We’re now scaling. It is in fact popular. People are running around buying my stuff. I need to have a demonstrable path to getting to healthy margins, let’s get to 50% to 80%. Once I’m at steady state, if I want to truly have a healthy business, if I want to be a profit engineer for my organization, and not a drag on that, getting to 80% is a really good guide. The cloud efficiency rate takes this abstract annoying thing dollars that none of us really want to understand or can understand, and turns it into something that’s a target that we can set in motion for us. It can be across our entire cloud platform, or it can be specific to an individual customer, or a feature, or a service, or anything new that I’m trying to ship. As a rule of thumb, I would recommend you target a CER of about 80%.
Conclusion
When I was doing the homework for what I wanted to talk about here, some people said that Sir Tony Hoare here was responsible for the premature optimization quote. This blew me away when I did my research, because this gentleman had me beat by years. I was playing in the wrong category. He had a billion-dollar mistake. I’m in the small leagues, I could up my game. He gave this talk at QCon years ago in London. He talked about inventing the null reference in 1965. Because of his early sin, he says this has probably cost billions of dollars in the world economy. He’s probably right. One thing we all know as engineers is that, over time, our code takes on a life of its own. It falls in other people’s hands. It moves on and we lose sight of it. Those things that cost a few pennies while we were writing that code, testing it out on a laptop, and now deployed and running somewhere, it could just be $1.1 million per year to run that. When you get back to your laptop, or your computer and take a peek at some of that stuff, think about that. Hopefully, you don’t have a gasped reaction. If you do find something, it’ll save you from giving a talk in 10 years about billion-dollar lines of code. I wanted everyone to leave with this idea in their head. Every engineering is a buying decision. You all are making the buying decisions. It’s a very powerful role that you have in the organization. It is, in this economy, in this world now, in this cloud driven world, it is, in my opinion, the most important role. You all have the power, more powerful than the CFO.
Questions and Answers
Participant 1: I work for a nonprofit, so calculating the efficiency rate is a little bit tricky because we don’t actually have revenue or products that we’re selling. What would you use to adjust this efficiency rate formulas and whatnot to try and calculate that when you don’t actually have a revenue stream?
Peterson: As a nonprofit, you have a number in your bank account, a budget somewhere, I would substitute that in that case. There’s another variant of this as well, if you’re at a government organization. You actually don’t want to come in under budget. You have to flip the equation around sometimes and actually hit your budget perfectly. Those are two scenarios where you have to think a little bit differently if you’re not in the business of generating revenue. I would take your budget or I would take your fundraising goal, if you want to be projecting out something or forecasting out a little bit, and say, we intend to meet these points of view, and this is what we’re going to allocate for our cloud budget. Then set that as how you would calculate that rate. It burns down a little bit differently.
Participant 2: There’s a huge system, a small team of engineers that are branching out some portion to the cloud. They’re facing a cost push in their move. What would you suggest? If it’s so hard to calculate the cloud efficiency from your example, because it’s a part of the system, as in time. Where do you say they can find the guidance for that. Parts of the system need migrating to the cloud.
Peterson: In the span of migrations, it’s really the same thing. You’re moving from one computing environment to another, you’re spending money. Hopefully you have a good understanding of what your on-prem operational costs are for that particular component. If you don’t, you’re going to have to do some back of the envelope math. You should have that in place already if you’re considering to move because it’s going to really help you judge if you were successful or not, at the end of the day. Let’s just assume that you’ve got some data there, that’s going to become your baseline. Then you’re going to need to apply that same thing to the equation. What was your cloud efficiency rate? What was your data center efficiency rate, essentially? The numbers might not be one to one. There are other things you’re going to have to probably incorporate into that on-prem: cooling power, headcount, people. Those are other parts that’ll contribute to the cogs. I would include those as well in the calculation, so you get something that looks a little bit more realistic.
See more presentations with transcripts