Story Points

Story Pointless (Part 3 of 3)

The final of this three-part series on moving away from Story Points and how to introduce empirical methods within your team(s). 

Part one refamiliarised ourselves with what story points are, a brief history lesson and facts about them, the pitfalls of using them and how we can use alternative methods for single item estimation. 

Part two looked at probabilistic vs. deterministic thinking, the use of burndown/burnups, the flaw of averages and monte carlo simulation for multiple item estimation.

Part three focuses on some common questions and challenges posed with these new methods, allowing you to handle those questions you may get asked when wanting to introduce a new approach in your teams/organisation.

The one question I get asked the most

Would you say story points have no place in Agile?

My personal preference is that just like Agile has evolved to be a ‘better way’ (in most contexts) than Waterfall, the methods described in this series are a ‘better way’ than using Story Points. Story Points make sense to be used in contexts where you have little or no dependencies and spend more time ‘doing’ than ‘waiting’.

Troy Magennis — What’s the Story About Agile Data

The problem is that so few teams in a large organisation like ours have this context yet have been made to “believe” story points are the right thing to do. For contexts like this, teams are much better off estimating the time they will spend ‘blocked’ or ‘waiting’, rather than the active time ‘doing’.

Common questions posed for single item estimation

But the value is in the conversation, isn’t that what story points are about?

Gaining a shared understanding of the work is most definitely important! The problem is that there are much better ways of understanding the problem you’re trying to solve than giving something a Fibonacci number and debating if something is a ‘2’ or a ‘3’ or why someone feels that way about a number. You don’t need a ‘number’ to have a conversation — don’t confuse estimation with analysis! The most effective way to learn and understand the problem is by doing the work itself. This method provides a much more effective approach in getting to that sooner than story points do.

Does this mean all stories are the same size?

No! This is a common misconception you may hear. What we care about is “right sizing” our items, meaning they are no larger than an agreed size. 

 This is derived by using the 85th (or the number of your choice!) percentile, as mentioned in part one.

What about task estimation?

Source

Not using tasks encourages collaboration, swarming and getting stories (rather than tasks) finished. It’s quite ironic that proponents of Scrum encourage sub tasks, yet one of the creators of Scrum (Jeff Sutherland) holds a different view, supported by data. In addition to this, Microsoft found that using estimates in hours had errors as large as ±400% of the estimate.

Should we not use working days and exclude weekends?

Whilst there is nothing to say excluding weekends is ‘bad’ — it again comes back to the principle of talking in the language of our customer. If we have a story that we say on 30th April 2021 has a 14-day cycle time at 85% likelihood — when is reasonable to expect it? It would be fair to say this is on or around 14th May.

Yet if we meant 14 working days this would be 21st May (due to the bank holiday) — which is a whole week extra! Actual days again makes it easier for our customers/stakeholders to understand as we’re talking in their language.

Do you find stakeholders accepting this? What options do you have when they don’t?

Stakeholders should (IMO) never *tell* a team how to estimate, as that is owned by the team. What I do communicate is the options they have on the likelihood. I would let them know 85% still has risk and if they want less risk (i.e., 90th/95th percentile) then it means a longer time, but the decision with risk is with them.

Why choose the 85th percentile?

The 85th percentile is common practice purely as it ‘feels right’. For most customers or stakeholders they’ll likely interpret this as “highly likely”, which will be good enough for them. Feel free to choose a higher percentile if you want less risk (but recognise it will be a longer duration!).

Common questions posed for multiple item estimation

Does this mean all stories are the same size?

No! See above.

Do we need to have lots of data for this?

Before considering how much data, the most important thing is stability of your system/process. For example if your work is highly seasonal, you might want to consider this in your input data to your forecast if the future work will be less ‘hectic’.

However, let’s get back to the question. You can get started with as little as three samples (three weeks or say three sprints worth) of data. The sweet spot is 7–15 samples, anything more than 15 and you’ll likely need to discard old data as it may negatively impact your forecasts.

With 5 samples we are confident that the median will fall inside the range of those 5 samples, so that already gives us an idea about our timing and we can make some simple projections.

(Source: Actionable Agile Metrics For Predictability)

With 11 samples we are confident that we know the whole range, as there is a 90% probability that every other sample will fall in that range.

(Source: German Tank Problem)

What if I don’t have any previous data?

Tools like the Excel sheet from Troy provide the ability to estimate your range in completed stories. Once you start the work, populate with your actual samples and change the spreadsheet to use ‘Data’ as an input.

What about if it’s a new technology/our team has changed?

Good question — throw away old data! Given you only need a few samples you should not let different contexts/team setups influence your forecast.

Should I do this at the beginning of my project/release and send to our stakeholders?

You should do it then and continue to doso as and when you get new samples, do not just do one forecast! Ensure you practice #ContinuousForecasting and caveat that any forecasts are a point in time based on current data. Remember, short term forecasts (i.e., a sprint) will be more accurate than longer ones (e.g., a year long forecast done at the start of a financial year).

What about alternative types of Monte Carlo Simulation? Markov chain etc.?

This is outside the scope of this article, but please check out this brilliant and thorough piece by Prateek Singh comparing the different types of Monte Carlo Simulation.

So does the opinion of individuals not matter?

Of course not :) These methods are just introducing an objective approach into that conversation, and getting us away from methods that can easily be manipulated by ‘group think’. Use it to inform your conversation, don’t just treat it as the answer.

Isn’t this more an “advanced” practice anyway? We’re pretty new to this way of working…

No! There is nothing in agile literature that says you have to start with story points (or Scrum/sprints for that matter), nor that you have to have been practicing other methods before this one. The problem with starting with methods such as story pointing is they are starting everyone off in a language no one understands. These other methods are not. In a world where unlearning and relearning is often the biggest factor in any adoption of new ways of working, I’d argue it’s our duty to make things easier for our people where we can. Speaking in a language they understand is key to that.

Conclusions

Story points != Agile. 

Any true Agilista should be wanting to stay true to the manifesto and always curious about uncovering better ways of working. Hopefully this series presents some fair challenges to traditional approaches but, more importantly, alternatives you can put into practice right away in your context.

Let me know in the comments if you liked this series, if it challenged you, anything you disagree with and/or any ways to make it even better.

— — — — — — — — — — — — — — — — — — — — — — — — — —

References for this series:

Story Pointless (Part 2 of 3)

The second in a three-part series on moving away from Story Points and how to introduce empirical methods within your team(s). 

Part one refamiliarised ourselves with what story points are, a brief history lesson and facts about them, the pitfalls of using them and how we can use alternative methods for single item estimation.

Part two looks at probabilistic vs. deterministic thinking, the use of burndown/burnups, the flaw of averages and monte carlo simulation for multiple item estimation.

Forecasting

You’ll have noticed in part one I used the word forecast a number of times, particularly when it came to the use of Cycle Time. It’s useful to clarify some meaning before we proceed.

What do we mean by a forecast?

Forecast — predict or estimate (a future event or trend).

What does a forecast consist of?

A forecast is a calculation about the future that includes both a range and a probability of that range occurring.

Where do we see forecasts?

Everywhere!

Sources: FiveThirtyEight & National Hurricane Centre

Forecasting in our context

In our context, we use forecasting to answer the key questions of:

  • When will it be done?

  • What will we get?

Which we typically do by:

Which we then visualize as a burnup/burndown chart, such as the example below. Feel free to play around with the inputs:

https://observablehq.com/embed/@nbrown/story-pointless?cells=viewof+work%2Cviewof+rate%2Cchart

All good right? Well not really…

The problems with this approach

The big issue with this approach is that the two inputs into our forecast(s) are highly uncertain, both are influenced by;

  • Additional work/rework

  • Feedback

  • Delivery team changes (increase/decrease)

  • Production issues

Neither inputs can be known exactlyupfront nor can they be simply taken as a single value, due to their variability.

And don’t forget the flaw of averages!

Plans based on average, fail on average (Sam L. Savage — The Flaw of Averages)

The above approach means forecasting using average velocity/throughput which, at best, is the odds of a coin toss!

Source:

Math with bad drawings — Why Not to Trust Statistics

Using averages as inputs to any forecasting is fraught with danger, in particular as it is not transparent to those consuming the information. If it was it would most likely lead to a different type of conversation:

But this is Agile — we can’t know exactly when something will be done!?!…

Source: Jon Smart — Sooner, Safer, Happier

Estimating when something will be done is particularly tricky in the world of software development. Our work predominantly sits in the domain of ‘Complex’ (using Cynefin) where there are “unknown unknowns”. Therefore, when someone asks, “when will it be done?” or “what will we get?” — when we estimate, we cannot give them a single date/number, as there are many factors to consider. As a result, you need to approach the question as one which is probabilistic (a range of possibilities) rather than deterministic (a single possibility).

Forecasts are about predicting the future, but we all know the future is uncertain. Uncertainty manifests itself as a multitude of possible outcomes for a given future event, which is what science calls probability.

To think probabilistically means to acknowledge that there is more than one possible future outcome which, for our context, this means using ranges, not absolutes.

Working with ranges

Communicating such a wide range to stakeholders is definitely not advisable nor is it helpful. In order to account for this, we need an approach that allows us to simulate lots of different scenarios.

The Monte Carlo method is a method of using statistical sampling to determine probabilities. Monte Carlo Simulation (MCS) is one implementation of the Monte Carlo method, where a real-world system is used to describe a probabilistic model. The model consists of uncertainties (probabilities) of inputs that get translated into uncertainties of outputs (results).

This model is run a large number (hundreds/thousands) of times resulting in many separate and independent outcomes, each representing a possible “future”. These results are then visualised into a probability distribution of possible outcomes, typically in a histogram.

TLDR; this is getting nerdy so please simplify

We use ranges (not absolutes) as inputs in the amount of work and the rate we do work. We run lots of different simulations to account for different outcomes (as we are using ranges).

So instead of this:

https://observablehq.com/embed/@nbrown/story-pointless?cells=viewof+work%2Cviewof+rate%2Cchart

We do this:

https://observablehq.com/embed/@nbrown/story-pointless?cells=chart2%2Cviewof+numberOfResultsToShow%2Cviewof+paceRange%2Cviewof+workRange

However, this is not easy on the eye! 

So what we then do is visualise the results on a Histogram, showing the distribution of the different outcomes.

We can then attribute percentiles (aka a probability of that outcome occurring) to the information. This allows us to present a range of outcomes and probability of those outcomes occurring, otherwise known as a forecast.

Meaning we can then move to conversations like this:

The exact same approach can be applied if we had a deadline we were working towards and we wanted to know “what will we get?” or “how far down the backlog will we get to”. The input to the forecast becomes the number of weeks you have, with the distribution showing the percentage likelihood against the number of items to be completed.

Tools to use

Clearly these simulations need computer input to help them be executed. Fortunately there are a number of tools out there to help:

  • Throughput Forecaster — a free and simple to use Excel/Google Sheets solution from troy.magennis that will do 500 simulations based on manual entry of data into a few fields. Probably the easiest and quickest way to get started, just make sure you have your Throughput and Backlog Size data.

  • Actionable Agile — a paid tool for flow metrics and forecasting that works as standalone SaaS solution or integrated within Jira or Azure DevOps. This tool can do up to 1 million simulations, plus gives a nice visual calendar date for the forecasts and percentage likelihood.

Source:

Actionable Agile Demo

  • FlowViz — a free Power BI template that I created for teams using Azure DevOps and GitHub Issues that generates flow metrics as well as monte carlo simulations. The histogram visual provides a legend which can be matched against a percentage likelihood.

Summary — multiple item forecasting

  • A forecast is a calculation about the future that includes both a range and a probability of that range occurring

  • Typically, we forecast using single values/averages — which is highly risky (odds of a coin toss at best)

  • Forecasting in the complex domain (Cynefin) needs to account for uncertainty (which using ‘average’ does not)

  • Any forecasts therefore need to be probabilistic (a range of possibilities) not deterministic (a single possibility)

  • Probabilistic Forecasting means running Monte Carlo Simulations (MCS) — simulating the future lots of different times

  • To do Monte Carlo simulation, we need Throughput data (number of completed items) and either a total number of items (backlog size) or a date we’re working towards

  • We should always continuously forecast as we get new information/learning, rather than forecasting just once

Ok but what about…

I’m sure you have lots of questions, as did I when first uncovering these approaches. To help you out I’ve collated the most frequently asked questions I get, which you can check out in part three

— — — — — — — — — — — — — — — — — — — — — — — — — —

References:

Story Pointless (Part 1 of 3)

The first in a three-part series on moving away from Story Points and how to introduce empirical methods within your team(s).

Part one refamiliarises ourselves with what story points are, a brief history lesson and facts about them, the pitfalls of using them and how we can use alternative methods for single item estimation.

What are story points?

Story points are a unit of measure for expressing an estimate of the overall effort (or some may say, complexity) that will be required to fully implement a product backlog item (PBI), user story or any other piece of work.

When we estimate with story points, we assign a point value to each item. Typically, teams will use a Fibonacci or Fibonacci-esque scale of 1,2,3,5,8,13,21, etc. Teams will often roll these points up as a means of measuring velocity (the sum of points for items completed that iteration) and/or planning using capacity (the number of points we can fit in an iteration).

Why do we use them?

There are many reasons why story points seem like a good idea:

  • The relative approach takes away the ‘date commitment’ aspect

  • It is quicker (and cheaper) than traditional estimation

  • It encourages collaboration and cross-functional behaviour

  • You cannot use them to compare teams — thus you should be unable to use ‘velocity’ as a weapon

A brief history lesson

Some things you might not know about story points:

Ron’s current thoughts on the topic

  • Story points are not (and never have been) mentioned in the Scrum Guide or viewed as mandatory as a part of Scrum

  • Story points originated from eXtreme Programming (XP)

  • - Chrysler Comprehensive Compensation (C3) project was the birth of XP

  • - They originally estimated in “ideal days” and later, unitless Story Points

  • - Ron Jeffries is credited with being the person who introduced them

  • James Grenning invented Planning Poker which was first publicised in Mike Cohn’s book Agile Estimating and Planning

  • Mountain Goat Software (Mike Cohn) own the trademark on planning poker cards and the copyright on the number sequence used for story point estimation

Problems with story points

What time would you tell your 

 friends you’d meet them?

They do not speak in the language of our customer

Telling our customers and stakeholders something is a “2” or a “3” does not help when it comes to new ways of working. What if we did this in other industries — what would you think as a customer? Would you be happy?

They may encourage the right behaviours, but also the wrong ones too

Agileis all about collaboration, iterative execution, customer value, and experimentation. Teams can have ‘high velocity’ but be finishing everything on the last day of the sprint (not working at a sustainable pace/mini waterfalls) and/or be delivering the wrong things (build the wrong thing). Similarly, teams are pressured to ‘increase velocity’ which is easy to artificially inflate by making every 2 into a 3, 3 into a 5, etc. — then we have increased our velocity!

They are hugely inconsistent within a team

Plot the actual time from starting to finishing an item (in days) against the story point estimate. Compare the variance for stories that had the same points estimate:

For this team (in Nationwide) we can see:

  • 1 point story — 1–59 days

  • 2 point story — 1–128 days

  • 3 point story — 1–442 days

  • 5 point story — 2–98 days

  • 8 point story — 1–93 days

They are a poor mechanism for planning / full of assumptions

Not only is velocity a highly volatile metric but it also encourages playing ‘Tetris’ with people in complex work. When estimating stories, teams purely take the story and acceptance criteria as written. They do not account for various assumptions (customer availability, platform reliability) and/or things that can go wrong or distract them (what is our WIP, discovery, refinement, production issues, bug-fixes, etc.) during an iteration.

Uncovering better ways

Agile has always been about “uncovering better ways”, after all it’s the first line of the Manifesto!

Given the limitations with story points, we should be open to exploring alternative approaches. When looking at uncovering new approaches, we need to be able to:

  • Forecast/Estimate a single item (PBI/User Story)

  • Forecast/Estimate our capacity at a sprint level (Sprint Backlog)

  • Forecast/Estimate our capacity at a release level (Release Backlog)

Source: Jon Smart — Sooner, Safer, Happier

Estimating when something will be done is particularly tricky in the world of software development. Our work predominantly sits in the domain of ‘Complex’ (using Cynefin) where there are “unknown unknowns”. Therefore, when someone asks, “when will it be done?” or “what will we get?” — we cannot estimate give them a single date/number, as there are many factors to consider. As a result, you need to approach the question as one which is probabilistic (a range of possibilities) rather than deterministic (a single possibility).

Forecasts are about predicting the future, but we all know the future is uncertain. Uncertainty manifests itself as a multitude of possible outcomes for a given future event, which is what science calls probability.

To think probabilistically means to acknowledge that there is more than one possible future outcome which, for our context, this means using ranges, not absolutes.

Single item forecast/estimation

One of the two key flow metrics that inputs into single item estimation is our Cycle Time. Cycle time is the amount of elapsed time between when a work item started and when a work item finished. We visualise this on a scatter plot, like so:

On the scatter plot, each ‘dot’ represents a PBI/user story, plotted against the completion date and the time (in days) it took to complete. Our 85th percentile (highlighted in the visual) tells us that 85% of our stories are completed within n days or less. Therefore with this team, we can say that 85% of the time we finish stories in 26 days or less.

We can communicate this to customers and stakeholders by saying that:

“If we start work on this today, there is an 85% chance it will be done in 26 days or less”

This may be sufficient for your customer (if so — great!), however they may push for it sooner. If, for instance, with this team they wanted the story in 7 days, you can show them (with data) that this is only 50% likely. Use this as a basis to start the conversation with them (and the rest of the team!) around breaking work down.

What about when work commences?

If they are happy with the forecast, and we start work on an item, it’s important that we don’t stop there and ensure we continue to manage the expectations of the customer.

Work Item Age is the second metric to use to maintain a continued focus on flow. This is the amount of time (in days) between when a item started and the current time. This applies only to items that are still in progress.

Each dot represents a user story and the age (in days) of that respective PBI/user story so far.

Use this in the Daily Scrum to track the age of an item against your 85th percentile time, as well as comparing to where an item is in your process.

If it is in danger of ‘breaching’ the cycle time, swarm on an item or break it down accordingly. If this can’t be done, work with your stakeholder(s) to collaborate on how to achieve the best outcome.

As a Scrum Master / Agile Delivery Manager / Coach, your role would be to guide the team in understanding the trade offs of high WIP age items vs. those closest to done vs. starting something new — no easy task!

Summary — Single Item Forecasting

In terms of a story pointless approach to estimating a single item, try the following:

  1. Prioritise your backlog

  2. Use your Cycle Time scatter plot and 85th percentile

  3. Take the next highest priority item on your backlog

  4. As a team, ask — “Do we think this can be delivered within our 85th percentile?”

  5. (Note: you can probe further and ask ‘can this be delivered within our 50th percentile?” to promote further slicing/refinement)

  6. If yes, then let’s get started/move it to ‘Ready’ 

  7. (considering your work-in-progress)

  8. If no, then find out why/break it down till it is small enough

  9. Once we start work on items, use Work Item Age as a leading indicator for flow

  10. Manage Work Item Age as part of your Daily Scrum, if it looks like it may exceed the 85th percentile — swarm/slice!

Please note: it’s best to familiarise yourself with what your 85th percentile is first (particularly in comparison to your cadence). 

If it’s 100+ days then you should be focusing initially on reducing that time — this can be done through various means such as pairing, mobbing, story mapping, story slicing, lowering WIP, etc.

But what about for multiple items? And what about…

For multiple item forecasting, be sure to check out part two.

If you have any questions, feel free to add them to the comments below in time for part three, which will cover common questions/observations people make about these new methods…

— — — — — — — — — — — — — — — — — — — — — — — — — —

References: