Within Feedback

Is Your Confidence Scale Actually Useful?

A good feedback loop checks not only whether confidence was accurate, but whether higher confidence actually picked easier cases.

On this page

  • What a forecast score can and cannot tell you
  • Calibration versus discrimination in judgement
  • How event difficulty can mislead feedback
Preview for Is Your Confidence Scale Actually Useful?

Introduction

A feedback loop only improves judgement if it measures more than whether predictions were right or wrong. It should also reveal whether your confidence tracked the strength of the evidence. The Brier score is one of the best-known tools for evaluating probabilistic forecasts because it rewards assigning realistic probabilities rather than making overconfident guesses. However, a good Brier score does not automatically mean your confidence scale is genuinely informative. Two forecasters can appear equally well calibrated while differing dramatically in their ability to distinguish easy cases from difficult ones. Understanding this distinction is essential if you want feedback that improves judgement rather than merely confirming that your average confidence was reasonable. [Wikipedia]WikipediaBrier scoreBrier score

Brier Scores illustration 1

What a forecast score can and cannot tell you

The Brier score measures the average squared difference between the probability you assigned and what actually happened. A forecast of 80% that succeeds receives a small penalty because it was close to reality, while an 80% forecast that fails receives a much larger one. Lower scores are better, and the score is considered a strictly proper scoring rule, meaning that, in expectation, it encourages honest probability estimates rather than strategic exaggeration. [Wikipedia]WikipediaBrier scoreBrier score

This makes the Brier score extremely valuable for learning from repeated predictions. If you consistently claim 90% confidence when only about 60% of those predictions come true, your Brier score will expose that overconfidence.

However, the score is an overall measure of forecast quality, not a diagnosis. By itself, it cannot explain why one forecaster performed better than another. Similar overall scores can arise from very different combinations of strengths and weaknesses. One person may be well calibrated but poor at recognising strong evidence; another may separate easy from difficult problems well but express probabilities that are systematically too high or too low. Looking only at the overall score hides these differences. [Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

Calibration versus discrimination in judgement

A common misunderstanding is to treat calibration as the whole story. Calibration asks:

When I say “70% likely”, does that event occur about seven times out of ten?

Discrimination (also called resolution in forecast verification) asks a different question:

Do I assign higher probabilities to events that are genuinely more likely than the ones I judge uncertain?

These are related but distinct abilities.

Imagine two forecasters evaluating 100 events.

  • Forecaster A predicts every event at 60%.
  • Forecaster B predicts some events at 95%, others at 10%, and reserves intermediate probabilities for ambiguous cases.

If both achieve good calibration overall, Forecaster B is still more useful because their confidence varies meaningfully with the available evidence. They distinguish situations where the evidence is genuinely compelling from situations where it is weak.

Forecast researchers therefore emphasise that useful probability judgements require both:

  • Calibration (reliability): stated confidence matches observed frequencies.
  • Resolution (discrimination): confidence changes appropriately across cases instead of clustering around the average probability.

Murphy’s classic decomposition of the Brier score separates these components mathematically into reliability, resolution and uncertainty, making it possible to identify whether poor performance comes from miscalibration or from failing to distinguish informative from uninformative situations. [Royal Meteorological Society+2Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

How event difficulty can mislead feedback

One reason calibration alone can be misleading is that some datasets simply contain few genuinely predictable events.

Suppose someone predicts hundreds of coin tosses at exactly 50%. Their calibration is essentially perfect because about half of the tosses land heads. Yet the forecasts provide no useful information whatsoever. They never identify an “easy” case because none exists.

Now compare this with predicting medical diagnoses, weather, or election outcomes. Some cases contain abundant evidence, while others remain genuinely uncertain. A useful forecaster should become more confident when evidence accumulates and less confident when it does not.

If every forecast remains close to the base rate—perhaps 45–55%—calibration may still appear excellent even though the forecaster has learned almost nothing from the evidence available in individual cases.

Murphy’s decomposition captures this through the resolution term. High resolution means forecasts move away from the overall base rate when evidence justifies it. Low resolution means forecasts stay close to the average regardless of what is known. [Royal Meteorological Society+2Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

This is an important lesson for improving judgement. Avoiding overconfidence is valuable, but permanently avoiding strong confidence can become a different kind of mistake if you fail to recognise situations where the evidence genuinely warrants it.

Brier Scores illustration 2

Why some tasks naturally produce worse scores

[Brier score]WikipediaBrier score core also depends partly on the inherent uncertainty of the problem itself.

Forecasting whether it will rain tomorrow in a region with highly variable weather is fundamentally harder than forecasting whether the sun will rise tomorrow. Even a perfect forecaster cannot eliminate uncertainty that genuinely exists in the world.

Murphy’s decomposition therefore includes an uncertainty component reflecting the variability of the underlying events themselves. High uncertainty does not necessarily indicate poor judgement; it may simply describe a difficult forecasting environment. [Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

This matters when comparing performance across domains.

  • Financial markets may have inherently noisier signals than controlled laboratory tasks.
  • Long-term geopolitical forecasts are generally harder than short-term weather predictions.
  • Rare events often require much larger datasets before Brier scores become statistically reliable.

Researchers in forecast verification have shown that rare events can make Brier scores unstable unless many observations are available, making comparisons between forecasters more difficult than they first appear. [Wikipedia]WikipediaBrier scoreBrier score

Practical limits of using Brier scores for personal improvement

For developing analytical skill, the Brier score is best viewed as one instrument rather than the entire dashboard.

Used alone, it cannot reveal:

  • whether confidence varies appropriately with evidence;
  • whether probabilities are concentrated too narrowly around the middle;
  • whether poor performance reflects bad judgement or simply unusually difficult questions;
  • whether improvement comes from better calibration or better discrimination.

This is why forecasting researchers increasingly recommend combining several diagnostics. Reliability diagrams reveal calibration visually, while measures such as receiver operating characteristic (ROC) curves assess discrimination separately. More recent work argues that examining calibration, discrimination and overall scoring together provides a much richer picture of judgement quality than any single metric alone. [PMC]pmc.ncbi.nlm.nih.govPMCStable reliability diagrams for probabilistic classifiersby T Dimitriadis · 2021 · Cited by 134 — We introduce the CORP approach, which generates provably statistically consistent, optimally…

Brier Scores illustration 3

Using feedback that improves judgement rather than just scores

For personal forecasting practice, the most useful lesson is straightforward: do not celebrate a good average score without asking whether your confidence scale is actually distinguishing stronger evidence from weaker evidence.

After reviewing a set of predictions, ask questions such as:

  • Did high-confidence predictions genuinely correspond to easier cases?
  • Were low-confidence predictions concentrated where the evidence was genuinely ambiguous?
  • Did different confidence levels correspond to meaningfully different success rates?
  • Am I learning to recognise stronger evidence, or merely becoming more cautious?

A feedback loop that answers only “Were my probabilities calibrated?” is incomplete. A better loop also asks whether your confidence carried useful information about the world. Improving analytical judgement requires both: confidence that is well calibrated and confidence that becomes stronger only when the evidence genuinely deserves it.

Amazon book picks

Further Reading

Books and field guides related to Is Your Confidence Scale Actually Useful?. Use these as the next step if you want deeper reading beyond the article.

BookCover for Superforecasting

Superforecasting

By Philip Eyrikson Tetlock, Dan Gardner

Explains forecasting accuracy, calibration, scoring rules, and feedback loops including Brier scores.

BookCover for Noise

Noise

By Daniel Kahneman, Olivier Sibony et al.

Helps readers understand variability in judgment beyond simple calibration metrics.

eBay marketplace picks

Marketplace Samples

Topic-anchored marketplace searches for visual, collectible, or second-hand items related to this page.

Using USA

Endnotes

  1. Source: Wikipedia
    Title: Brier score
    Link: https://en.wikipedia.org/wiki/Brier_score

  2. Source: pmc.ncbi.nlm.nih.gov
    Title: PMCStable reliability diagrams for probabilistic classifiers
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC7923594/
    Source snippet

    by T Dimitriadis · 2021 · Cited by 134 — We introduce the CORP approach, which generates provably statistically consistent, optimally...

  3. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Murphy
    Source snippet

    MurphyMurphy is a surname of Irish origin meaning 'sea warrior'. Murphy. Pronunciation, /ˈmɜːrfi/. Language, English. Origin. Language...

  4. Source: rmets.onlinelibrary.wiley.com
    Link: https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.2985
    Source snippet

    Royal Meteorological SocietySimplifying and generalising Murphy's Brier score...by S Siegert · 2017 · Cited by 45 — The decomposition of...

  5. Source: rmets.onlinelibrary.wiley.com
    Link: https://rmets.onlinelibrary.wiley.com/doi/10.1002/qj.2985
    Source snippet

    Royal Meteorological SocietySimplifying and generalising Murphy's Brier score...15 Dec 2016 — The decomposition of the Brier score into...

  6. Source: rmets.onlinelibrary.wiley.com
    Link: https://rmets.onlinelibrary.wiley.com/doi/10.1002/qj.4478
    Source snippet

    Royal Meteorological SocietyA conditional decomposition of proper scores: quantifying the...24 Apr 2023 — The classical decomposition of...

  7. Source: insightful-data-lab.com
    Title: Murphy’s Decomposition
    Link: https://insightful-data-lab.com/2025/08/21/murphys-decomposition/
    Source snippet

    Your Gateway to Data Mastery21 Aug 2025 — Murphy's decomposition = a way to break down forecast error into calibration (reliability), sha...

  8. Source: metricgate.com
    Title: brier score decomposition
    Link: https://metricgate.com/docs/brier-score-decomposition/
    Source snippet

    CalculatorDec 7, 2024 — A high Brier score might stem from poor calibration (high reliability), poor discrimination (low resolution), or...

  9. Source: emergentmind.com
    Title: brier score term
    Link: https://www.emergentmind.com/topics/brier-score-term
    Source snippet

    Brier Score: Calibration, Resolution, and Uncertainty25 Jul 2025 — The Brier score term evaluates probabilistic forecasts by decomposing...

Additional References

  1. Source: murphyoilcorp.com
    Link: https://www.murphyoilcorp.com/
    Source snippet

    Murphy Oil: Global Offshore Onshore ExplorationMurphy Oil Corporation is a global oil exploration & production company. Our North America...

  2. Source: cambridge.org
    Link: https://www.cambridge.org/core/services/aop-cambridge-core/content/view/8172E04F2DBC601DA5D953D4685CA346/S1930297500007099a.pdf/weighted_brier_score_decompositions_for_topically_heterogenous_forecasting_[tournaments
    Source snippet

    Weighted Brier score decompositions for topically...by EC Merkle · 2018 · Cited by 13 — Brier score decompositions, including those attr...

  3. Source: stats.stackexchange.com
    Title: how does the brier score break down to reliability resolution uncertainty
    Link: https://stats.stackexchange.com/questions/597679/how-does-the-brier-score-break-down-to-reliability-resolution-uncertainty
    Source snippet

    does the Brier Score break down to (Reliability2 Dec 2022 — The Wikipedia page states this in the decompositions section, and it is also...

  4. Source: youtube.com
    Title: Exploring the intricacies of prediction markets | David Chee | EAG London 23
    Link: https://www.youtube.com/watch?v=S4Oa_CD5a0w
    Source snippet

    This Model Calibration Overview is highly relevant because it explicitly explains how the Brier score evaluates probabilistic forecasts a...

  5. Source: murphydriverewards.com
    Title: Earn up to 100 points for signing up
    Link: https://www.murphydriverewards.com/pwa_home
    Source snippet

    Find loads of deals on home and road trip essentials. Get Rewards. Rack up the reward points buying gas, snacks, drinks, and more. Gas, t...

  6. Source: murphynet.com
    Title: Murphy Company Mechanical Contractor
    Link: https://murphynet.com/
    Source snippet

    LouisMurphy Company is a leading full-service mechanical contracting firm with over 100 years of experience, giving you a single source f...

  7. Source: research-information.bris.ac.uk
    Title: 2015 ecml decomposition cameraready
    Link: https://research-information.bris.ac.uk/files/76351926/2015_ecml_decomposition_cameraready.pdf
    Source snippet

    Novel Decompositions of Properby M Kull · 2015 · Cited by 129 — Interestingly, none of the decompositions re- lates to the loss of the op...

  8. Source: stats.stackexchange.com
    Title: unpack the notation used in wikipedias decomposition of the brier score
    Link: https://stats.stackexchange.com/questions/631333/unpack-the-notation-used-in-wikipedias-decomposition-of-the-brier-score
    Source snippet

    The article starts out easy enough by defining the Brier score to be: BS=1NN∑i=1(ft...

  9. Source: youtube.com
    Link: https://www.youtube.com/watch?v=oJa7JVPjnpQ
    Source snippet

    Exploring the intricacies of prediction markets | David Chee | EAG London 23...

  10. Source: centaur.reading.ac.uk
    Link: https://centaur.reading.ac.uk/29154/
    Source snippet

    reading.ac.ukReliability, sufficiency, and the decomposition of proper scoresby J Bröcker · 2009 · Cited by 300 — In the binary case, str...

Topic Tree

Follow this branch

Parent topic

Feedback How Feedback Makes Judgement Sharper

Related pages 5