Is Your Confidence Scale Actually Useful?

Introduction

A feedback loop only improves judgement if it measures more than whether predictions were right or wrong. It should also reveal whether your confidence tracked the strength of the evidence. The Brier score is one of the best-known tools for evaluating probabilistic forecasts because it rewards assigning realistic probabilities rather than making overconfident guesses. However, a good Brier score does not automatically mean your confidence scale is genuinely informative. Two forecasters can appear equally well calibrated while differing dramatically in their ability to distinguish easy cases from difficult ones. Understanding this distinction is essential if you want feedback that improves judgement rather than merely confirming that your average confidence was reasonable. [Wikipedia]WikipediaBrier scoreBrier score

Brier Scores illustration 1

What a forecast score can and cannot tell you

The Brier score measures the average squared difference between the probability you assigned and what actually happened. A forecast of 80% that succeeds receives a small penalty because it was close to reality, while an 80% forecast that fails receives a much larger one. Lower scores are better, and the score is considered a strictly proper scoring rule, meaning that, in expectation, it encourages honest probability estimates rather than strategic exaggeration. [Wikipedia]WikipediaBrier scoreBrier score

This makes the Brier score extremely valuable for learning from repeated predictions. If you consistently claim 90% confidence when only about 60% of those predictions come true, your Brier score will expose that overconfidence.

However, the score is an overall measure of forecast quality, not a diagnosis. By itself, it cannot explain why one forecaster performed better than another. Similar overall scores can arise from very different combinations of strengths and weaknesses. One person may be well calibrated but poor at recognising strong evidence; another may separate easy from difficult problems well but express probabilities that are systematically too high or too low. Looking only at the overall score hides these differences. [Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

Calibration versus discrimination in judgement

A common misunderstanding is to treat calibration as the whole story. Calibration asks:

When I say “70% likely”, does that event occur about seven times out of ten?

Discrimination (also called resolution in forecast verification) asks a different question:

Do I assign higher probabilities to events that are genuinely more likely than the ones I judge uncertain?

These are related but distinct abilities.

Imagine two forecasters evaluating 100 events.

Forecaster A predicts every event at 60%.
Forecaster B predicts some events at 95%, others at 10%, and reserves intermediate probabilities for ambiguous cases.

If both achieve good calibration overall, Forecaster B is still more useful because their confidence varies meaningfully with the available evidence. They distinguish situations where the evidence is genuinely compelling from situations where it is weak.

Forecast researchers therefore emphasise that useful probability judgements require both:

Calibration (reliability): stated confidence matches observed frequencies.
Resolution (discrimination): confidence changes appropriately across cases instead of clustering around the average probability.

Murphy’s classic decomposition of the Brier score separates these components mathematically into reliability, resolution and uncertainty, making it possible to identify whether poor performance comes from miscalibration or from failing to distinguish informative from uninformative situations. [Royal Meteorological Society+2Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

How event difficulty can mislead feedback

One reason calibration alone can be misleading is that some datasets simply contain few genuinely predictable events.

Suppose someone predicts hundreds of coin tosses at exactly 50%. Their calibration is essentially perfect because about half of the tosses land heads. Yet the forecasts provide no useful information whatsoever. They never identify an “easy” case because none exists.

Now compare this with predicting medical diagnoses, weather, or election outcomes. Some cases contain abundant evidence, while others remain genuinely uncertain. A useful forecaster should become more confident when evidence accumulates and less confident when it does not.

If every forecast remains close to the base rate—perhaps 45–55%—calibration may still appear excellent even though the forecaster has learned almost nothing from the evidence available in individual cases.

Murphy’s decomposition captures this through the resolution term. High resolution means forecasts move away from the overall base rate when evidence justifies it. Low resolution means forecasts stay close to the average regardless of what is known. [Royal Meteorological Society+2Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

This is an important lesson for improving judgement. Avoiding overconfidence is valuable, but permanently avoiding strong confidence can become a different kind of mistake if you fail to recognise situations where the evidence genuinely warrants it.

Brier Scores illustration 2

Why some tasks naturally produce worse scores

[Brier score]WikipediaBrier score core also depends partly on the inherent uncertainty of the problem itself.

Forecasting whether it will rain tomorrow in a region with highly variable weather is fundamentally harder than forecasting whether the sun will rise tomorrow. Even a perfect forecaster cannot eliminate uncertainty that genuinely exists in the world.

Murphy’s decomposition therefore includes an uncertainty component reflecting the variability of the underlying events themselves. High uncertainty does not necessarily indicate poor judgement; it may simply describe a difficult forecasting environment. [Royal Meteorological Society]rmets.onlinelibrary.wiley.comRoyal Meteorological SocietySimplifying and generalising Murphy's Brier score…by S Siegert · 2017 · Cited by 45 — The decomposition of…

This matters when comparing performance across domains.

Financial markets may have inherently noisier signals than controlled laboratory tasks.
Long-term geopolitical forecasts are generally harder than short-term weather predictions.
Rare events often require much larger datasets before Brier scores become statistically reliable.

Researchers in forecast verification have shown that rare events can make Brier scores unstable unless many observations are available, making comparisons between forecasters more difficult than they first appear. [Wikipedia]WikipediaBrier scoreBrier score

Practical limits of using Brier scores for personal improvement

For developing analytical skill, the Brier score is best viewed as one instrument rather than the entire dashboard.

Used alone, it cannot reveal:

whether confidence varies appropriately with evidence;
whether probabilities are concentrated too narrowly around the middle;
whether poor performance reflects bad judgement or simply unusually difficult questions;
whether improvement comes from better calibration or better discrimination.

This is why forecasting researchers increasingly recommend combining several diagnostics. Reliability diagrams reveal calibration visually, while measures such as receiver operating characteristic (ROC) curves assess discrimination separately. More recent work argues that examining calibration, discrimination and overall scoring together provides a much richer picture of judgement quality than any single metric alone. [PMC]pmc.ncbi.nlm.nih.govPMCStable reliability diagrams for probabilistic classifiersby T Dimitriadis · 2021 · Cited by 134 — We introduce the CORP approach, which generates provably statistically consistent, optimally…

Brier Scores illustration 3

Using feedback that improves judgement rather than just scores

For personal forecasting practice, the most useful lesson is straightforward: do not celebrate a good average score without asking whether your confidence scale is actually distinguishing stronger evidence from weaker evidence.

After reviewing a set of predictions, ask questions such as:

Did high-confidence predictions genuinely correspond to easier cases?
Were low-confidence predictions concentrated where the evidence was genuinely ambiguous?
Did different confidence levels correspond to meaningfully different success rates?
Am I learning to recognise stronger evidence, or merely becoming more cautious?

A feedback loop that answers only “Were my probabilities calibrated?” is incomplete. A better loop also asks whether your confidence carried useful information about the world. Improving analytical judgement requires both: confidence that is well calibrated and confidence that becomes stronger only when the evidence genuinely deserves it.

Amazon book picks

Marketplace Samples

Topic-anchored marketplace searches for visual, collectible, or second-hand items related to this page.

Example eBay listing

Desktop Glass Whiteboard with Storage Caddy/ Pen/ Dry Eraser

Search eBay.co.uk: desktop whiteboard

Browse similar on eBay.co.uk

Example eBay listing

A3 Magnetic Whiteboard 40 x 30 cm Hanging Folding Double Sided Dry Erase

Search eBay.co.uk: desktop whiteboard

Browse similar on eBay.co.uk

Example eBay listing

Desktop Mini Whiteboard Storage Dry Erase Board W/ Marker&Eraser Desk Organizer

Search eBay.co.uk: desktop whiteboard

Browse similar on eBay.co.uk

Example eBay listing

Magnetic Whiteboard Desktop Double-Sided Dry Erase 1 Pens 2 Magnets with Stand

Search eBay.co.uk: desktop whiteboard

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: Wikipedia
Title: Brier score
Link: https://en.wikipedia.org/wiki/Brier_score
Source: pmc.ncbi.nlm.nih.gov
Title: PMCStable reliability diagrams for probabilistic classifiers
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC7923594/
Source snippet
by T Dimitriadis · 2021 · Cited by 134 — We introduce the CORP approach, which generates provably statistically consistent, optimally...
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Murphy
Source snippet
MurphyMurphy is a surname of Irish origin meaning 'sea warrior'. Murphy. Pronunciation, /ˈmɜːrfi/. Language, English. Origin. Language...
Source: rmets.onlinelibrary.wiley.com
Link: https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.2985
Source snippet
Royal Meteorological SocietySimplifying and generalising Murphy's Brier score...by S Siegert · 2017 · Cited by 45 — The decomposition of...
Source: rmets.onlinelibrary.wiley.com
Link: https://rmets.onlinelibrary.wiley.com/doi/10.1002/qj.2985
Source snippet
Royal Meteorological SocietySimplifying and generalising Murphy's Brier score...15 Dec 2016 — The decomposition of the Brier score into...
Source: rmets.onlinelibrary.wiley.com
Link: https://rmets.onlinelibrary.wiley.com/doi/10.1002/qj.4478
Source snippet
Royal Meteorological SocietyA conditional decomposition of proper scores: quantifying the...24 Apr 2023 — The classical decomposition of...
Source: insightful-data-lab.com
Title: Murphy’s Decomposition
Link: https://insightful-data-lab.com/2025/08/21/murphys-decomposition/
Source snippet
Your Gateway to Data Mastery21 Aug 2025 — Murphy's decomposition = a way to break down forecast error into calibration (reliability), sha...
Source: metricgate.com
Title: brier score decomposition
Link: https://metricgate.com/docs/brier-score-decomposition/
Source snippet
CalculatorDec 7, 2024 — A high Brier score might stem from poor calibration (high reliability), poor discrimination (low resolution), or...
Source: emergentmind.com
Title: brier score term
Link: https://www.emergentmind.com/topics/brier-score-term
Source snippet
Brier Score: Calibration, Resolution, and Uncertainty25 Jul 2025 — The Brier score term evaluates probabilistic forecasts by decomposing...

Additional References

Source: murphyoilcorp.com
Link: https://www.murphyoilcorp.com/
Source snippet
Murphy Oil: Global Offshore Onshore ExplorationMurphy Oil Corporation is a global oil exploration & production company. Our North America...
Source: cambridge.org
Link: https://www.cambridge.org/core/services/aop-cambridge-core/content/view/8172E04F2DBC601DA5D953D4685CA346/S1930297500007099a.pdf/weighted_brier_score_decompositions_for_topically_heterogenous_forecasting_[tournaments
Source snippet
Weighted Brier score decompositions for topically...by EC Merkle · 2018 · Cited by 13 — Brier score decompositions, including those attr...
Source: stats.stackexchange.com
Title: how does the brier score break down to reliability resolution uncertainty
Link: https://stats.stackexchange.com/questions/597679/how-does-the-brier-score-break-down-to-reliability-resolution-uncertainty
Source snippet
does the Brier Score break down to (Reliability2 Dec 2022 — The Wikipedia page states this in the decompositions section, and it is also...
Source: youtube.com
Title: Exploring the intricacies of prediction markets | David Chee | EAG London 23
Link: https://www.youtube.com/watch?v=S4Oa_CD5a0w
Source snippet
This Model Calibration Overview is highly relevant because it explicitly explains how the Brier score evaluates probabilistic forecasts a...
Source: murphydriverewards.com
Title: Earn up to 100 points for signing up
Link: https://www.murphydriverewards.com/pwa_home
Source snippet
Find loads of deals on home and road trip essentials. Get Rewards. Rack up the reward points buying gas, snacks, drinks, and more. Gas, t...
Source: murphynet.com
Title: Murphy Company Mechanical Contractor
Link: https://murphynet.com/
Source snippet
LouisMurphy Company is a leading full-service mechanical contracting firm with over 100 years of experience, giving you a single source f...
Source: research-information.bris.ac.uk
Title: 2015 ecml decomposition cameraready
Link: https://research-information.bris.ac.uk/files/76351926/2015_ecml_decomposition_cameraready.pdf
Source snippet
Novel Decompositions of Properby M Kull · 2015 · Cited by 129 — Interestingly, none of the decompositions re- lates to the loss of the op...
Source: stats.stackexchange.com
Title: unpack the notation used in wikipedias decomposition of the brier score
Link: https://stats.stackexchange.com/questions/631333/unpack-the-notation-used-in-wikipedias-decomposition-of-the-brier-score
Source snippet
The article starts out easy enough by defining the Brier score to be: BS=1NN∑i=1(ft...
Source: youtube.com
Link: https://www.youtube.com/watch?v=oJa7JVPjnpQ
Source snippet
Exploring the intricacies of prediction markets | David Chee | EAG London 23...
Source: centaur.reading.ac.uk
Link: https://centaur.reading.ac.uk/29154/
Source snippet
reading.ac.ukReliability, sufficiency, and the decomposition of proper scoresby J Bröcker · 2009 · Cited by 300 — In the binary case, str...

Is Your Confidence Scale Actually Useful?

Introduction

What a forecast score can and cannot tell you

Calibration versus discrimination in judgement

How event difficulty can mislead feedback

Why some tasks naturally produce worse scores

Practical limits of using Brier scores for personal improvement

Using feedback that improves judgement rather than just scores

Further Reading

Superforecasting

The Signal and the Noise

Noise

Thinking, Fast and Slow

Marketplace Samples

Desktop Glass Whiteboard with Storage Caddy/ Pen/ Dry Eraser

A3 Magnetic Whiteboard 40 x 30 cm Hanging Folding Double Sided Dry Erase

Desktop Mini Whiteboard Storage Dry Erase Board W/ Marker&Eraser Desk Organizer

Magnetic Whiteboard Desktop Double-Sided Dry Erase 1 Pens 2 Magnets with Stand

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 5