Lisa Wimmer

0

1

67

Andreas Kirsch 🇺🇦@BlackHC·11 Ağu

I'm tagging a few people who might find this interesting and/or might have some lit pointers with better insights 🤞 @AdaptiveAgents @AndreyMalinin @kschweig_ @aichberger @HochreiterSepp @roydanroy @tom_rainforth @fbickfordsmith @WmLisa 15/15

English

4

0

1

1.1K

Andreas Kirsch 🇺🇦@BlackHC·11 Ağu

A small info-theory thread (or at least food for thought): Why is the Bayesian Model Average the best choice? Really why? I'll go through a naive argument (anyone has better references?), simple lower-bounds and decompositions, and pitch a "reverse mutual information" 1/15

English

4

22

153

17.5K

Lisa Wimmer@WmLisa·13 Ağu

@DHolzmueller @BlackHC The way I see it, there's only one true AU and that does not depend on the epistemic state of the learner (i.e., that it's a Dirac mixture). The *estimate* of AU does, and it will be 0 for x=0, but is that reliable in this situation of utter conflict?

English

1

50

David Holzmüller@DHolzmueller·6 Ağu

@WmLisa @BlackHC To a certain degree yes, but for example here at x=0 don't you think there would be good reason to predict something close to a mixture of Diracs, and therefore assume low AU?

English

Andreas Kirsch 🇺🇦@BlackHC

0

1

62

Lisa Wimmer@WmLisa·1 Ağu

If we understand your criticism correctly, it boils down to a concept of disagreement rather than ignorance and lack of knowledge, which is what epistemic uncertainty is actually supposed to capture. Please note that your suggestion of a second-order distribution ...

Reading "Quantifying Aleatoric and Epistemic Uncertainty in Machine Learning: Are Conditional Entropy and Mutual Information Appropriate Measures?" and not as impressed I had hoped to be. I think it is a good example for why descriptive/axiomatic approaches to uncertainty can be wrong/not helpful. The paper looks at second-order distributions: when our model provides a distribution of distributions. E.g. a Dirichlet distribution (below) is a distribution over the simplex which can be seen as the probabilities of a categorical distribution (here for 3 classes). For example, we have a deep ensemble and each ensemble member predicts a categorical distributions. Then the set of all predictions is an empirical distribution of distributions, and we can define uncertainty measures on it. In particular: - aleatoric uncertainty: (predicted) irreducible observation noise, - epistemic uncertainty: (predicted) reducible noise - if we had more data, the model might give us better predictions, - total uncertainty: something that measures both of the above together. Epistemic uncertainty can be connected to prediction disagreement between ensemble members. The more different ensemble members disagree on a given sample, the more we would learn if we had more data (as we can cull more of the hypothesis space, or redistribute more probability mass in our parameter distribution). Let's assume we look at classification. The paper proposes the following properties (that partially mostly make no sense to) and then examines whether the common EU/AU/TU decomposition satisifes them: A0 TU, AU, and EU are non-negative. A1 EU vanishes for Dirac measures Q=\delta_\theta. A2 EU and TU are maximal for Q being the uniform distribution on \Delta_K^{(2)}. A3 If Q' is a mean-preserving spread of Q, then EU(Q') >= EU(Q) (weak version) or EU(Q') > EU(Q) (strict version); the same holds for TU. A4 If Q' is a center-shift of Q, then AU(Q') >= AU(Q) (weak version) or AU(Q') > AU(Q) (strict version); the same holds for TU. A5 If Q' is a spread-preserving location shift of Q, then EU(Q')=EU(Q). A0 might be sensible --- or if uncertainty was in -\inf, +inf, we could always use exp(.) to make it non-negative. A1 makes sense: if all models output the same distribution, we won't have any epistemic uncertainty. A2 is already wrong, however: Predictions from different ensemble members should have the highest epistemic uncertainty, when they disagree the most with each other. That is the case, when every ensemble member predicts a different class with 100% confidence. In the image above, this would mean that our second-order distribution is concentrated in the corners. But this is not a uniform distribution in the space of second order distributions. Otoh if any ensemble member had shared class predictions (conf > 0 for a class) with another member, obv, their disagreement wouldn't be maximal already. So A2 is wrong (imo). But this also takes down A3: A3 says that if we "spread" out the distributions in the second-order space, EU has to increase (or at least not decrease). But this is not true: if our second-order distribution is concentrated in the corners (EU is maximal), spreading it out can only decrease EU. Hence, a contradiction to A3. A5 is similarly flawed then. It says that if we move our concentration around (but preserve "everything else"), EU has to stay constant. However, let's imagine we have 2 ensemble members predict in a 3-class case, with predictions (0, 1/2, 1/2), and (1/3, 1/3, 1/3), resp (in Dirichlet plot this would be the middle of the x3 edge and the center) . Now let's imagine we shift everything up: to (1/3, 1/3, 1/3) and (1, 0, 0) (in the plot center and upper corner). Again, the disagreement (and EU) is stronger in the latter configuration than in the former, so A5 does not make sense either. Obviously, one could equally argue that my takes on disagreement do not make sense. However, I believe that epistemic uncertainty via disagreement is empirically grounded in the expected information gain (via lots of evidence). (There is more to be said at which information gain we want to look at exactly, but that is for a different post.) But maybe the more acceptable challenge is to why the proposed properties in this paper should be considered natural.

English

0

7

1.2K

Lisa Wimmer@WmLisa·6 Ağu

@DHolzmueller @BlackHC That illustrates the point quite well, I guess - the reported AU is 0, but should we trust this estimate? The answer is no, I think, because disagreement (or indeed, EU, according to Andreas :)) is high, so there's good reason to be wary

English

0

1

73

David Holzmüller@DHolzmueller·3 Ağu

@BlackHC @WmLisa But in your "mixture of Diracs" example case, wouldn't aleatoric uncertainty also make sense (and be zero) even though epistemic uncertainty (according to your definition) is high?

English

0

2

75

Lisa Wimmer@WmLisa·6 Ağu

@BlackHC @DHolzmueller That point kind of echoes our discussion in the paper: at least, we can only hope to get a reasonable estimate of AU when EU is low; otherwise, the estimate will not be trustworthy

English

46

Andreas Kirsch 🇺🇦@BlackHC·3 Ağu

@WmLisa @DHolzmueller I don't accept that one btw (or say that it has to be correct). I actually think it's wrong in the sense that aleatoric uncertainty only makes sense for low epistemic uncertainty, in which case aleatoric uncertainty \approx total uncertainty

English

0

89

Lisa Wimmer@WmLisa·3 Ağu

@DHolzmueller @BlackHC I guess that by definition, these 2 must be the same - if we're willing to accept that total = aleatoric + epistemic uncertainty, and the aleatoric component is non-reducible, right?

English

0

2

92

David Holzmüller@DHolzmueller·3 Ağu

@BlackHC @WmLisa So the question is whether you want to quantify "reducible total uncertainty" or "reducible second-order uncertainty" ?

English

0

82

Lisa Wimmer@WmLisa·3 Ağu

@BlackHC I'm not saying disagreement shouldn't be part of the story, just not all of it :) but how does "ignorant about the truth" become manifest here? the learners don't *signal* ignorance, quite the opposite. plus imo: max reduction = max concentration potential, so back to uniform 🙃

English

0

1

68

Andreas Kirsch 🇺🇦@BlackHC·1 Ağu

@WmLisa So ignorant about the truth but maximally opinionated, which also means that it makes the uncertainty for this point maximally reducible with more data, which is the definition of epistemic uncertainty as reducible uncertainty 🤓

English

0

76

Lisa Wimmer@WmLisa·1 Ağu

It is also commonly taken as "noninformative" prior in Bayesian inference (where your model would cause an inconsistency as soon as the data contains observations from both classes). Happy to discuss further! 🙌

English

1

205

Lisa Wimmer@WmLisa·1 Ağu

We agree that the uniform distribution can also be questioned as a model a complete ignorance, but as long as knowledge must be represented in terms of probability, is is commonly accepted as a best choice (justified, e.g., by the max entropy principle).

English

0

1

232

Lisa Wimmer@WmLisa·1 Ağu

@BlackHC yeah, I suspected something was off - sorry, will do :)

English

91

Andreas Kirsch 🇺🇦@BlackHC·1 Ağu

@WmLisa I think your replies got spread out of order 🥺 could you put all the replies one after another in a thread? 🙏 sorry for this but I don't know which order to read them in and reply to 🙈

English

0

140

Andreas Kirsch 🇺🇦@BlackHC·28 Tem

Reading "Quantifying Aleatoric and Epistemic Uncertainty in Machine Learning: Are Conditional Entropy and Mutual Information Appropriate Measures?" and not as impressed I had hoped to be. I think it is a good example for why descriptive/axiomatic approaches to uncertainty can be wrong/not helpful. The paper looks at second-order distributions: when our model provides a distribution of distributions. E.g. a Dirichlet distribution (below) is a distribution over the simplex which can be seen as the probabilities of a categorical distribution (here for 3 classes). For example, we have a deep ensemble and each ensemble member predicts a categorical distributions. Then the set of all predictions is an empirical distribution of distributions, and we can define uncertainty measures on it. In particular: - aleatoric uncertainty: (predicted) irreducible observation noise, - epistemic uncertainty: (predicted) reducible noise - if we had more data, the model might give us better predictions, - total uncertainty: something that measures both of the above together. Epistemic uncertainty can be connected to prediction disagreement between ensemble members. The more different ensemble members disagree on a given sample, the more we would learn if we had more data (as we can cull more of the hypothesis space, or redistribute more probability mass in our parameter distribution). Let's assume we look at classification. The paper proposes the following properties (that partially mostly make no sense to) and then examines whether the common EU/AU/TU decomposition satisifes them: A0 TU, AU, and EU are non-negative. A1 EU vanishes for Dirac measures Q=\delta_\theta. A2 EU and TU are maximal for Q being the uniform distribution on \Delta_K^{(2)}. A3 If Q' is a mean-preserving spread of Q, then EU(Q') >= EU(Q) (weak version) or EU(Q') > EU(Q) (strict version); the same holds for TU. A4 If Q' is a center-shift of Q, then AU(Q') >= AU(Q) (weak version) or AU(Q') > AU(Q) (strict version); the same holds for TU. A5 If Q' is a spread-preserving location shift of Q, then EU(Q')=EU(Q). A0 might be sensible --- or if uncertainty was in -\inf, +inf, we could always use exp(.) to make it non-negative. A1 makes sense: if all models output the same distribution, we won't have any epistemic uncertainty. A2 is already wrong, however: Predictions from different ensemble members should have the highest epistemic uncertainty, when they disagree the most with each other. That is the case, when every ensemble member predicts a different class with 100% confidence. In the image above, this would mean that our second-order distribution is concentrated in the corners. But this is not a uniform distribution in the space of second order distributions. Otoh if any ensemble member had shared class predictions (conf > 0 for a class) with another member, obv, their disagreement wouldn't be maximal already. So A2 is wrong (imo). But this also takes down A3: A3 says that if we "spread" out the distributions in the second-order space, EU has to increase (or at least not decrease). But this is not true: if our second-order distribution is concentrated in the corners (EU is maximal), spreading it out can only decrease EU. Hence, a contradiction to A3. A5 is similarly flawed then. It says that if we move our concentration around (but preserve "everything else"), EU has to stay constant. However, let's imagine we have 2 ensemble members predict in a 3-class case, with predictions (0, 1/2, 1/2), and (1/3, 1/3, 1/3), resp (in Dirichlet plot this would be the middle of the x3 edge and the center) . Now let's imagine we shift everything up: to (1/3, 1/3, 1/3) and (1, 0, 0) (in the plot center and upper corner). Again, the disagreement (and EU) is stronger in the latter configuration than in the former, so A5 does not make sense either. Obviously, one could equally argue that my takes on disagreement do not make sense. However, I believe that epistemic uncertainty via disagreement is empirically grounded in the expected information gain (via lots of evidence). (There is more to be said at which information gain we want to look at exactly, but that is for a different post.) But maybe the more acceptable challenge is to why the proposed properties in this paper should be considered natural.

English

4

9

81

12.2K

Lisa Wimmer 리트윗함

ECML PKDD@ECMLPKDD·30 Ağu

📢The 2023 Best Paper Award for the Research Track goes to "Towards Efficient MCMC Sampling in Bayesian Neural Networks by Exploiting Symmetry" by Jonas Gregor Wiese, Lisa Wimmer, Theodore Papamarkou, Bernd Bischl, Stephan Günnemann, David Rügamer 🥳 congrats to the authors!

English

Michael Kirchhof@mkirchhof_

8

13

1.9K

Lisa Wimmer@WmLisa·30 May

Great insights about multi-source uncertainty based on a recent preprint by my wonderful colleague @CoGruber 🔥 read for a truly holistic story, made in Statistics, that starts with the data arxiv.org/pdf/2305.16703

If I say that there is irreducible (aleatoric) uncertainty in every ML task, that doesn't mean I don't believe in determinism of the physical world. Here's why: 🧵1/9

English