The tragic combination of inevitable bugs and immutable code
Last week witnessed a catastrophic event in the Ethereum ecosystem, when The DAO, a smart contract less than two months old, began rapidly leaking funds to an unknown party. Looking at the current set of Ethereum contracts, filled with casinos and self-declared Ponzi schemes, this might not seem like a big deal. That is, until you learn that over 12 million units of ether, the Ethereum cryptocurrency, had been invested in The DAO by almost 20,000 people. That’s around 15% of all the ether in existence, valued at over $250 million on June 17th.
Two days later, The DAO’s assets dipped below $100 million. Two things contributed to this precipitous fall. First, a third of its funds (as denominated in ether) had already been taken. And second, the resulting panic sent the market price of ether crashing down from its peak of over $21 to a more sobering $10.67. (At the time of publication, the price had recovered to around $14.) This second effect was a natural consequence of the first, since much of ether’s recent increase in value was driven by people buying it to invest in The DAO.
The DAO had promised to act as a new type of decentralized crowdsourcing vehicle, like Kickstarter or Indiegogo but without the middleman and regulation. It was designed to let participants pool their cryptocurrency, collectively vote on projects looking for funding, then invest and reap the future rewards. Before catastrophe struck, over 100 projects had already been proposed, most of which were related to Ethereum itself. In addition, The DAO allowed participants to withdraw their uninvested funds at any time, positioning itself as a low risk investment.
Ironically, the individual or group which drained The DAO did so by exploiting subtle errors in this withdrawal mechanism. Like all smart contracts in Ethereum, The DAO is just a piece of computer code, which is “immutably” (i.e. permanently and irreversibly) embedded in the blockchain and executed by every node in response to incoming transactions. And like any self-respecting smart contract, The DAO provides full transparency by making its source code easily accessible online. This means that anybody can independently verify its functionality but also, crucially, look for vulnerabilities. And yet, the immutable nature of blockchains prevents any such problems from being fixed.
At the end of May, several critical issues were highlighted on the outstanding Hacking Distributed blog, alongside a call for a moratorium on project proposals for The DAO. This is what we might call the ‘white hat’ approach, in which exploits are reported for the good of the community. Nonetheless nobody seemed too worried, as the problems related to skewed economic incentives rather than a risk of outright theft. Simultaneously, however, it appears that others were poring over The DAO’s code with greater self-interest – namely, to look for a way to make a ton of money. And on June 17th, someone succeeded.
Draining The DAO
In a general sense, the attack arose from the interaction between vulnerabilities in The DAO’s code and other code which was designed to exploit them. You see, when looked at in isolation, The DAO did not contain any obvious mistakes, and indeed it was only released after an extensive security audit. But with the benefit of hindsight and many more eyes, a significant number of errors have since been found.
I won’t provide a full technical description of the exploit’s mechanism here, since others have already published superb and detailed post mortems (see here, here and here). But I will explain one particular vulnerability that was present, because it has been discovered in many other smart contracts and serves as an instructive example.
Let’s say that a smart contract holds funds on behalf of a number of users, and allows those users to withdraw their funds on request. The logic for the process might look something like this:
- Wait for a user to request a withdrawal.
- Check if that user’s balance is sufficient.
- If so, send the requested quantity to the user’s address.
- Check that the payment was successful.
- If so, deduct the quantity from the user’s balance.
This all looks eminently sensible, and rather like an ATM which gives you some cash and deducts the appropriate amount from your bank balance.
So how can this simple process go wrong? Well, it turns out that if an Ethereum address belongs to a contract rather than a regular user, then this contract can run some code in response to receiving funds. And this code can, in turn, trigger other pieces of code on the Ethereum blockchain. Crucially, it can even trigger the same piece of code that caused it to be paid in the first place.
This means that, during step 3 above, the receiving address can send a new request for withdrawal, beginning a new process at step 1 before the previous process has completed. Since the user’s balance is only reduced in step 5, a new withdrawal will be approved based on the previous balance, and the same amount will be paid out again. In response to this second payment, the receiving contract can request a third, and then a fourth, and so on until the funds are drained or some other limit is reached. At this point, the user’s balance will finally be reduced by the appropriate amount, entering the negative territory which step 2 was supposed to prevent.
The equivalent would be an ATM which delivers banknotes that trigger a free repeat withdrawal when waved at the screen. The first customer to find out could empty the ATM entirely.
This ability for a piece of code to wind up calling itself is called recursion, and is a very useful technique in general computer programming. However in the case of The DAO, it paved the way for this ruinous exploit. Nonetheless, if this had been the only problem, the attack’s potential would have been contained, because Ethereum applies a limit on how deeply recursion can occur. Unfortunately, several further bugs in The DAO amplified the effects, leading to the eventual loss of tens of millions of dollars.
Of course, if just a few lines of The DAO’s code had been written differently, none of this could have happened. For example, in the 5-step process above, if the user’s balance is reduced before the funds are sent, then recursive calling would be perfectly safe. But sadly, even if its creators’ intentions were pure, The DAO’s actual code was deeply flawed. And computers have a nasty habit of blindly following the instructions they are given, even if a five year old can see that the results don’t make sense. Having been embedded immutably in the Ethereum blockchain, the faulty DAO was granted stewardship over hundreds of millions of dollars by a horde of naïve investors, and then spectacularly went up in flames. The DAO turned out to be a complete and utter shambles, and it can never be fixed.
The trouble with code
Tempting as it might be, I’m not here to haul The DAO’s programmers over the technical coals. Looking at the underlying source code, it seems reasonably well architected, with good function and variable names and clear internal documentation. While none of this proves its quality, there tends to be a high correlation between how code looks and how well it functions, for the same reason that CVs with poor punctuation warn of sloppy employees. In any event I don’t doubt that The DAO’s authors are competent developers – indeed, the fact that it passed an extensive code review suggests that the basic logic was sound.
So if the problem is not the people who worked on this project, or the work they produced, what is it? It is the fact that writing large pieces of bug-free code is extremely hard, if not impossible. I’ve worked with some truly outstanding programmers in my career, the sort who can crank out code at ten times the average developer’s pace, and with ten times fewer defects. And yet, even these remarkable individuals make mistakes which lead to software malfunctions. Donald Knuth, possibly the greatest computer programmer of all time, made a famous promise to provide an exponentially increasing financial reward to each person who found a bug in his TeX typesetting software. And he’s sent out more than a few checks.
To be clear, I’m not talking about silly slip-ups with names like “off-by-one”, “uninitialized variable” and “operator precedence”. These often cause a visible failure the first time a program is run, and can be easily spotted by reviewing the local piece of code in which they reside. And I’m not even talking about security vulnerabilities like “unvalidated inputs”, “SQL injection” and “buffer overflows”, which might not show up in a program’s regular usage, but should nonetheless be front of mind for every experienced developer.
Rather, I’m talking about trickier problems like “race conditions” and “deadlocks”. These arise from conflicts between parallel processes and tend to only show up intermittently, making them hard to detect and reproduce. As a result, they can only be understood by considering a system as a whole and how its constituent parts interact. This is much harder than regular programming, because it requires developers to think beyond the individual piece of code that they’re working on. It’s not unusual for coders to spend several days “debugging” in order to nail one of these problems down. And this is precisely the sort of holistic thinking that was needed to foresee how The DAO might be vulnerable.
With all of these difficulties, one might legitimately wonder why our increasingly code-driven world isn’t crumbling around us. Luckily, most software has three critical factors working in its favor – gradual adoption, regular updates and time.
Here’s how it works: A new software product is created to answer an emerging market need. At first, the market is small, so only a few people know they need the product. And since the product is new, an even smaller number of them will actually find it. These “early adopters” are a brave and hardy bunch who enjoy living on the technological edge, despite the associated risks. So they try out the new product, see some stuff they like, ask for a bunch of things that are missing and, best of all, report any problems encountered. Every good software entrepreneur knows to shower these people with love and assistance, and thank them for every single morsel of feedback they provide. Because while it sucks to hear about a defect in your product, it sucks a lot more not to hear about it.
Ideally, within a month or less, a new version of the product is released, fixing the reported bugs and adding some requested features. The early adopters are happy and more feedback flows in, as the latest version is put through its paces, and round it goes again. As the market grows, the number of people using the product increases. And as the product steadily improves, more and more of these people tell others about it. Even better, the more people that use the product, the more likely it is that someone, somewhere, will create that precise and unlikely situation in which an obscure bug will appear. With a bit of luck, they will let you know, and you will scratch your head in disbelief, ask for more information, eventually find and resolve the problem, and breathe a sigh of relief.
With few exceptions, this is how today’s software development works, because it is the most efficient way to create outstanding products. Of course, a good software team will also develop an extensive internal test suite, to catch as many errors as possible before they reach users, and ensure that new versions don’t break anything that previously worked. But still, most of us also rely on our user bases, because there is simply no way that we can afford to imagine and test every possible way in which our products might be used. And if you think this doesn’t apply to the big guys, you couldn’t be more wrong. How many “automatic updates” have been downloaded to your Windows, Mac or Linux system in the past year? And if you’re using Chrome or Firefox, your web browser now updates itself automatically and silently, an average of once per month.
This iterative process takes considerable time, by which I mean a few years or more. Still, after a product has been in development for long enough, and its user base has grown large enough, and those users have been (unknowingly) testing it in enough different situations, something magic happens. This magic is called “maturity”, and it’s what every software product must strive to achieve. Maturity means that a product works really well for pretty much everybody that uses it, and there are no shortcuts to getting there. But if you get the timing right, your product will mature at around the time that your target market coalesces, i.e. when large numbers of customers are actually willing to stump up and pay for it. And then, as they say, verily shall ye profit.
On immutable code
So here we come to the fundamental problem with smart contracts, as demonstrated so forcefully by The DAO:
By design, smart contracts are immutably embedded in a blockchain, and so cannot be updated. This prevents them from reaching maturity.
In previous posts, I’ve discussed other problems with smart contracts, such as their effect on blockchain performance and the fact that they are less powerful than many people imagine. For these and other reasons, we have not (yet) implemented smart contracts in the MultiChain blockchain platform. But until I witnessed the failure of The DAO, I hadn’t given enough thought to a much more fundamental issue: any non-trivial smart contract is likely to contain defects that cannot be fixed.
For the modern software developer, unfixable code is an out-and-out nightmare, setting the bar higher than most are able to reach. But we do encounter this kind of code in some situations, such as the design of the microprocessors which lie at the heart of every computer and smartphone. This code, written in languages like Verilog and VHDL, defines the physical layout of a silicon chip, which cannot be changed once manufactured. In situations like these, we tend to see several characteristics: (a) the code is written in a language that was designed with safety in mind, (b) large numbers of people work on it for several years, (c) it is subject to extensive automated testing and formal verification, and (d) if the final product is shipped with a defect, the cost of a recall falls squarely on the shoulders of the party responsible (see for example the infamous Pentium bug).
It goes without saying that none of this applies to the creators of The DAO, or indeed any other smart contract. But code immutability isn’t the only challenge for smart contract developers. A number of other factors conspire to make Ethereum considerably more dangerous than most computing environments:
- As discussed earlier, most contracts reveal their source code, to gain the trust of potential users. This makes bugs easy to find and exploit. While regular code can be fixed when a problem is found, with immutable code only attackers get to benefit.
- As in most programming languages, one “function” (piece of code) on the blockchain is able to “call” (trigger) another, to create cascading effects. However Ethereum is unusual in enabling direct function calls between the code written by parties who do not know each other and whose interests may collide. This is a perfect recipe for adversarial and unexpected behavior.
- As mentioned previously, if one Ethereum contract sends funds to another, the latter has the opportunity to execute some code in response. This code can be deliberately designed to cause the send operation to fail, potentially triggering all sorts of further havoc.
- When one function calls another, and this second function calls a third, a “stack” of calls and sub-calls is created. Keeping track of this stack carries a computational cost, so Ethereum includes a “call stack limit” which restricts how deep it can go. This is fair enough. But if the limit is reached by a particular function call, the Ethereum environment silently skips that call, rather than safely terminating the entire transaction and unwinding its effects. In other words, some code in a smart contract just might not be executed, and this non-execution can be deliberately caused by triggering that contract from a sufficiently deep stack. This strikes me as a truly abominable design choice, breaking the mental model that every software developer is accustomed to. Whoever made this decision probably should be hauled over the coals, though there is thankfully now a suggestion to change it.
- Ethereum also has a “gas limit”, which prevents abuse in public blockchains by making transactions pay for the computational resources they consume. The sender of a transaction decides how much gas they are willing to spend, and if this runs out before the transaction completes, it is safely aborted. While this is probably the best solution to a difficult problem, it can have unpleasant consequences. Some contracts turn out to need more gas than anticipated, while others cannot be run at all.
- The public Ethereum network’s cryptocurrency allows defects in smart contracts to send real money to the wrong place, with no easy method of recovery. While Ethereum miners seem to be voting in favor of a “soft fork” to freeze the funds drained from The DAO, this is not a sustainable solution.
To summarize, compared to regular centralized computer systems, Ethereum is a much more tricky environment to code for safely. And yet its principle of immutability serves to prevent buggy software from being updated. In other words, smart contracts are software whose bugs are visible, cannot be fixed, and directly control real people’s money. This, rather obviously, is a highly toxic mix.
Proponents of Ethereum-style smart contracts in private blockchains might be tempted to celebrate The DAO’s demise, but I don’t think this response is merited. With the exception of the last two points above, all of the issues with Ethereum apply equally to permissioned blockchains, which still rely on immutable smart contracts – although in this case the immutability is guaranteed by a group of identified parties rather than anonymous miners. If you want to claim that private blockchains allow buggy smart contracts to be more easily rewound, replaced or ignored, then what you’re really saying is that smart contracts serve no purpose in these blockchains at all. Put simply, if something is not meant to be immutable, it shouldn’t be stored in a blockchain. Instead, stick to good old fashioned legal documents and centralized application logic, using the chain for: (a) immutably storing the data on which that logic depends, and (b) representing the final consensual outcome of applying it. (This design pattern has been named Simple Contracts by others.)
Nonetheless the risks in the public Ethereum network are undoubtedly worse, because badly written smart contracts can rapidly and irreversibly send large amounts of real value (in the form of cryptocurrency) to users whose identity is unknown. Indeed, is there any better way for an evil genius to make a killing than: (a) writing a smart contract which looks right and fair, (b) allowing it to run safely and consistently for several years, (c) waiting for it to accumulate a large sum of money from investors, and then (d) triggering some obscure vulnerability to siphon off those funds. While I’m not suggesting that The DAO’s failure was deliberate, it will surely inspire others to make similar “mistakes”.
If I had to summarize the factors underlying Ethereum’s design, I might use the phrase “inexperienced genius”. Genius, because I believe it is a genuinely brilliant invention, adding two key innovations to the cryptocurrency systems that came before: (a) the Ethereum Virtual Machine which executes smart contracts and its method for assigning cost to computation, and (b) the use of Patricia trees to enable compact proofs of any aspect of a blockchain’s state. And yet, inexperienced as well, because some of Ethereum’s design choices are so obviously terrible, such as the silent-but-violent call stack limit, or the ability of a payment recipient to recursively trigger the code which paid it.
None of this would be a problem if Ethereum was being treated as an experiment, worthy of exploration but with critical issues remaining to be resolved. The equivalent perhaps of bitcoin during its first couple of years, when its total market capitalization didn’t go beyond a few million dollars. Unfortunately, as a result of speculation and inflated expectations, Ethereum hasn’t been given the same opportunity to find its proverbial feet. Instead, at less than one year old, it’s carrying a billion dollars in market value. Ethereum is like a toddler being forced to cook dinner, or an economics freshman chairing the Federal Reserve. I believe it’s time to recognize that the immaturity problem of individual smart contracts also applies to Ethereum as a whole.
Ethereum’s way forward
While I’m yet to see strong use cases for smart contracts in private or permissioned blockchains, I think they probably do have a place in public chains with associated cryptocurrencies. That is, if you accept the basic premise of censorship-free financial systems, which help the financially excluded and ransomware authors in equal measure. Putting this debate aside, there is certainly technical merit in a cryptocurrency which supports arbitrary logic, of the sort that cannot be implemented on “first generation” blockchains like bitcoin. For now at least, Ethereum is the first and only convincing attempt to build such a system, with a ton of money and momentum behind it.
Nonetheless, as a developer ecosystem, Ethereum appears to be fundamentally broken. While The DAO is its most costly and high profile failure, many other contracts are suffering from similar problems. So how can Ethereum clean up its act?
- Send a clear message that, at least for the next two years, nobody should send any funds to a smart contract unless they are happy to lose them in the name of self-education.
- Fix some glaring issues with the Ethereum Virtual Machine (“EVM”), namely: (a) removing the call stack limit, (b) providing a way to send ether without triggering code, and (c) allowing contracts to be marked as “non-reentrant”, meaning that their functions cannot be called while they are already in the middle of something.
- Develop a new programming language for smart contracts, which uses a more restrictive method for expressing computation that is amenable to formal proofs of correctness. Decades of research have already been invested in this field, so there is much existing work to be leveraged. (This won’t require changes to the EVM itself, since the chosen language could still be compiled into regular “bytecode”.)
- Build up an official set of secure smart contracts and functions, which have been peer-reviewed to death and proven themselves reliable in many different situations. This is akin to the standard libraries that are available for many mature programming languages. (Though at this point it’s tempting to ask: why not just hard-code the functionality of these libraries into the EVM, and enjoy much better performance as a result? Answer: Because Ethereum was specifically designed to move away from blockchains with hard-coded feature sets. But still, it does make you wonder.)
The current option, of manually intervening in response to the failure of specific smart contracts, will not be viable on a larger scale if Ethereum is to maintain its identity as a trustless and decentralized computing platform. Indeed, some make a credible case that this single judgment-based act of governance has already destroyed Ethereum’s reputation. And we should note that The DAO’s terms and conditions explicitly state that nothing “may modify or add any additional obligations or guarantees beyond those set forth in The DAO’s code”. In other words, whoever drained The DAO was acting in accordance with its published terms, and is therefore presumably on the right side of the law.
We must also accept the possibility that, after several more years of good work, Ethereum might still prove too difficult for developers to work with safely. In that case, it will languish as a matchmaking service between anonymous scammers and their foolish marks. But that wouldn’t mean it was a waste of time – at the very least, Ethereum is a fascinating experiment, from which the blockchain community will learn a lot.
In the meantime, for users of private blockchains, I can only repeat what I’ve said before:
If your application doesn’t require smart contracts, then use a simpler blockchain architecture.
Whereas this advice was previously justified in terms of performance, it is now reinforced by the apparent difficulty of getting smart contracts right. And if you’re not sure whether your use case requires smart contracts, feel free to email us with some details, and we’ll be happy to let you know.
Please post any comments on LinkedIn.