Antifragility And Software Development

One of the books I finally managed to read on my half-year trip was Antifragile: Things That Gain From Disorder by Nassim Nicolas Taleb, a little late to the game perhaps. It introduces the idea of antifragility, and Taleb starts the book with the following quote:

”Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better”

While the book was flawed in terns of delivery in my opinion, it presented some remarkable insights and important observations of randomness, systems and modelling. As an engineer I was compelled to take the ideas in the book and ask myself if I can apply them to what I do day-to-day, which is what follows below.

First I want to expand slightly on why I enjoy the premise of the book. I am a firm believer in the wrong but useful definition of models, the random and intractable nature of reality, and forever spooked by humanity’s propensity to over-simplify said reality via models and symbols. For me, building models that take on board the inevitability of random events and seek to gain from them feels closer to reality and therefore purer than models that attempt to dismiss random and unfortunate events as negligible due to small probabilities. The same applies to the realm (and fallacy) of planning - we can’t possibly predict all possible future events nor encompass all variables that might have a causal link to the outcome of a project at the beginning, so we should accept that we will encounter the unknown and factor that into our model, as uncomfortable as that makes us. In fact, Taleb goes as far as to say that ‘depriving systems of vital stressors is not necessarily a good thing and can be downright harmful’.

Perhaps naively and due to my software engineering background, I see two varieties of antifragility that differ due to the scale of observation:

  • The first (intrinsic) concerns characteristics of a system itself that mean that random events as input or general volatility cause a positive outcome, however that is defined. An example of this would be a currency trading algorithm that detects increases in volatility in the market and makes decisions based on that leading to a positive outcome. Another example would be evolution at an overall scale - if you define the positive outcome of evolution as ‘making sure species still exist on Earth’, then the characteristic of high variety of species existing at the same time meant that random events such as sudden cooling of the Earth or meteorite impacts didn’t wipe out all species.
  • The second (comparative) is characteristics of a system that in itself can only be described as robust, but can be described as ‘antifragile’ in comparision to a competiting system with mutually agreed positive outcomes. For example, autoscaling in an sports news web service is a robust feature (redundancy), but if your competitor didn’t build that into their service and falls offline during traffic spikes, you stand to gain their lost traffic, a positive outcome. In the evolution example, it would be the characteristics that can be retroactively described as ‘fittest’ when one species survives and the other doesn’t.

This is somewhat grossly oversimplified and contrived - antifragile, fragile, robust, positive outcomes, negative outcomes - like most things these are not mutually exclusive, meaning that any given time systems will be observed by someone to be one or all of these. Having said that, stay with me - the point of this investigation wasn’t to learn to model with yet another set of characteristics to define a system, but to consider factors that our conventional thinking would leave us blind to.

I want to generalise about software projects in general, and to do this my stance is the first type, instrinsic, is not something you can account for a software development system. It is more a characteristic of the product or thought process you are modelling when you build said system, and as such is hard to generalise about. I am most likely wrong, but for the purpose of this investigation that is stance I have taken.

With that said, I think there are a few weak categories to use when considering how you make a software development system generally robust in order to benefit from random events comparatively - not guaranteed to benefit or survive, just in a position too should the right random event occur. Examples of random events could be a team member leaving, being off work for extend periods of time, huge traffic spikes, or changes in the market you are occupying.

1) Code & Product

  • Make high code modifiability a priority - keep interfaces between different parts of the business clean and simple, be it microservices or parts of a monolith.
  • Similarly, assume any code you add is likely to be removed one day meaning you should focus on making it easy to remove. Keep your ability to choose another option for part of the product high - keep your optionality high.
  • Be cognisant of nonlinearity in your models. If you are making assumptions, look at those assumptions and move them slightly to see if that induces a convex or concave response that could cause havoc for your outcome.
  • When making assumptions of worst case scenario, always exaggerate to cover situations that are possible but beyond what you yourself have encountered. From there find the balance between cost of prematurely optimising and the cost of not implementing the solution to handle the worst case now.
  • A small point but choose a popular, supported language or framework that is likely to still be prevalent in the years to come, and doesn’t show signs of losing support or experienced developers. The great thing about the web at least is that this isn’t really an issue for languages. The same applies for any third party services you use, such as analytics, logging, or hosting.
  • Also consider what features of your language or framework might help to keep optionality high and move quickly while maintaining stability - features like static typing, listing, test coverage.

Infrastructure

  • Argue for autoscaling for your production systems - you will have so many more problems on your plate once you hit significant traffic, justifing easy provisioning of scale to handle the traffic should be obvious.
  • Similarly, if you are aiming to scale quickly, think early about geolocation - moving servers closer to your customers, even if that means choosing AWS over Heroku and having to add more tooling at the beginning. In my experience the latency difference to your users is probably the biggest factor in your bottom line in those markets in different territories to your servers.
  • Similarly to the worst case scenario above, argue for redundancy in all services in terms of required resources. The cost now is likely to far outweigh by the cost of missed opportunity should you not factor in redundancy.

Team

  • Avoid silos of knowledge and bus factor problems - regularly pair and rotate who works on what if possible.
  • If possible hire from a variety of different software and commercial experience - make the most of the breadth of experience in order to test your potentially nonlinear assumptions.
  • In terms of engineering, avoid hiring quicker than you need to. Unless you are in the lucky position of having clear teams working on clearly separated projects, hiring quickly could cause a huge overheard in terms of keeping productivity high.

Market

  • More of rehash of an earlier point, but keep your optionality as high as possible. The market will change, your assumptions will be wrong, so you need to be able to recognise that and moving quickly. Don’t get comfortable.
  • Build the product in such as way that those people whose role it is to recognise opportunites or threats can see the data they need to make decisions - don’t be surprised.

While this all seems like common sense for most experienced people in software, I think its interesting to look at these tenants with the idea that Taleb is emphasising in his book. These are all great ideas in terms of staying productive, or not letting your team become fire fighting twitch-powered developers, but also they provide the basis for making the most of random opportunities. Essentially - keep optionality high and calculated redundancy high.

If you enjoyed this post and made it to the end, good job, please leave any comments below.

comments powered by Disqus