Lisa Wimmer

14 posts

Lisa Wimmer

Lisa Wimmer

@WmLisa

PhD candidate @sldslmu @MunichCenterML @LMU_Muenchen

انضم Eylül 2020
77 يتبع79 المتابعون
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
A small info-theory thread (or at least food for thought): Why is the Bayesian Model Average the best choice? Really why? I'll go through a naive argument (anyone has better references?), simple lower-bounds and decompositions, and pitch a "reverse mutual information" 1/15
Andreas Kirsch 🇺🇦 tweet media
English
4
22
153
17.5K
Lisa Wimmer
Lisa Wimmer@WmLisa·
@DHolzmueller @BlackHC The way I see it, there's only one true AU and that does not depend on the epistemic state of the learner (i.e., that it's a Dirac mixture). The *estimate* of AU does, and it will be 0 for x=0, but is that reliable in this situation of utter conflict?
English
0
0
1
50
David Holzmüller
David Holzmüller@DHolzmueller·
@WmLisa @BlackHC To a certain degree yes, but for example here at x=0 don't you think there would be good reason to predict something close to a mixture of Diracs, and therefore assume low AU?
David Holzmüller tweet media
English
1
0
1
62
Lisa Wimmer
Lisa Wimmer@WmLisa·
If we understand your criticism correctly, it boils down to a concept of disagreement rather than ignorance and lack of knowledge, which is what epistemic uncertainty is actually supposed to capture. Please note that your suggestion of a second-order distribution ...
Andreas Kirsch 🇺🇦@BlackHC

Reading "Quantifying Aleatoric and Epistemic Uncertainty in Machine Learning: Are Conditional Entropy and Mutual Information Appropriate Measures?" and not as impressed I had hoped to be. I think it is a good example for why descriptive/axiomatic approaches to uncertainty can be wrong/not helpful. The paper looks at second-order distributions: when our model provides a distribution of distributions. E.g. a Dirichlet distribution (below) is a distribution over the simplex which can be seen as the probabilities of a categorical distribution (here for 3 classes). For example, we have a deep ensemble and each ensemble member predicts a categorical distributions. Then the set of all predictions is an empirical distribution of distributions, and we can define uncertainty measures on it. In particular: - aleatoric uncertainty: (predicted) irreducible observation noise, - epistemic uncertainty: (predicted) reducible noise - if we had more data, the model might give us better predictions, - total uncertainty: something that measures both of the above together. Epistemic uncertainty can be connected to prediction disagreement between ensemble members. The more different ensemble members disagree on a given sample, the more we would learn if we had more data (as we can cull more of the hypothesis space, or redistribute more probability mass in our parameter distribution). Let's assume we look at classification. The paper proposes the following properties (that partially mostly make no sense to) and then examines whether the common EU/AU/TU decomposition satisifes them: A0 TU, AU, and EU are non-negative. A1 EU vanishes for Dirac measures Q=\delta_\theta. A2 EU and TU are maximal for Q being the uniform distribution on \Delta_K^{(2)}. A3 If Q' is a mean-preserving spread of Q, then EU(Q') >= EU(Q) (weak version) or EU(Q') > EU(Q) (strict version); the same holds for TU. A4 If Q' is a center-shift of Q, then AU(Q') >= AU(Q) (weak version) or AU(Q') > AU(Q) (strict version); the same holds for TU. A5 If Q' is a spread-preserving location shift of Q, then EU(Q')=EU(Q). A0 might be sensible --- or if uncertainty was in -\inf, +inf, we could always use exp(.) to make it non-negative. A1 makes sense: if all models output the same distribution, we won't have any epistemic uncertainty. A2 is already wrong, however: Predictions from different ensemble members should have the highest epistemic uncertainty, when they disagree the most with each other. That is the case, when every ensemble member predicts a different class with 100% confidence. In the image above, this would mean that our second-order distribution is concentrated in the corners. But this is not a uniform distribution in the space of second order distributions. Otoh if any ensemble member had shared class predictions (conf > 0 for a class) with another member, obv, their disagreement wouldn't be maximal already. So A2 is wrong (imo). But this also takes down A3: A3 says that if we "spread" out the distributions in the second-order space, EU has to increase (or at least not decrease). But this is not true: if our second-order distribution is concentrated in the corners (EU is maximal), spreading it out can only decrease EU. Hence, a contradiction to A3. A5 is similarly flawed then. It says that if we move our concentration around (but preserve "everything else"), EU has to stay constant. However, let's imagine we have 2 ensemble members predict in a 3-class case, with predictions (0, 1/2, 1/2), and (1/3, 1/3, 1/3), resp (in Dirichlet plot this would be the middle of the x3 edge and the center) . Now let's imagine we shift everything up: to (1/3, 1/3, 1/3) and (1, 0, 0) (in the plot center and upper corner). Again, the disagreement (and EU) is stronger in the latter configuration than in the former, so A5 does not make sense either. Obviously, one could equally argue that my takes on disagreement do not make sense. However, I believe that epistemic uncertainty via disagreement is empirically grounded in the expected information gain (via lots of evidence). (There is more to be said at which information gain we want to look at exactly, but that is for a different post.) But maybe the more acceptable challenge is to why the proposed properties in this paper should be considered natural.

English
2
0
7
1.2K
Lisa Wimmer
Lisa Wimmer@WmLisa·
@DHolzmueller @BlackHC That illustrates the point quite well, I guess - the reported AU is 0, but should we trust this estimate? The answer is no, I think, because disagreement (or indeed, EU, according to Andreas :)) is high, so there's good reason to be wary
English
1
0
1
73
David Holzmüller
David Holzmüller@DHolzmueller·
@BlackHC @WmLisa But in your "mixture of Diracs" example case, wouldn't aleatoric uncertainty also make sense (and be zero) even though epistemic uncertainty (according to your definition) is high?
English
2
0
2
75
Lisa Wimmer
Lisa Wimmer@WmLisa·
@BlackHC @DHolzmueller That point kind of echoes our discussion in the paper: at least, we can only hope to get a reasonable estimate of AU when EU is low; otherwise, the estimate will not be trustworthy
English
0
0
0
46
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
@WmLisa @DHolzmueller I don't accept that one btw (or say that it has to be correct). I actually think it's wrong in the sense that aleatoric uncertainty only makes sense for low epistemic uncertainty, in which case aleatoric uncertainty \approx total uncertainty
English
2
0
0
89
Lisa Wimmer
Lisa Wimmer@WmLisa·
@DHolzmueller @BlackHC I guess that by definition, these 2 must be the same - if we're willing to accept that total = aleatoric + epistemic uncertainty, and the aleatoric component is non-reducible, right?
English
2
0
2
92
David Holzmüller
David Holzmüller@DHolzmueller·
@BlackHC @WmLisa So the question is whether you want to quantify "reducible total uncertainty" or "reducible second-order uncertainty" ?
English
1
0
0
82
Lisa Wimmer
Lisa Wimmer@WmLisa·
@BlackHC I'm not saying disagreement shouldn't be part of the story, just not all of it :) but how does "ignorant about the truth" become manifest here? the learners don't *signal* ignorance, quite the opposite. plus imo: max reduction = max concentration potential, so back to uniform 🙃
English
1
0
1
68
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
@WmLisa So ignorant about the truth but maximally opinionated, which also means that it makes the uncertainty for this point maximally reducible with more data, which is the definition of epistemic uncertainty as reducible uncertainty 🤓
English
1
0
0
76
Lisa Wimmer
Lisa Wimmer@WmLisa·
It is also commonly taken as "noninformative" prior in Bayesian inference (where your model would cause an inconsistency as soon as the data contains observations from both classes). Happy to discuss further! 🙌
English
0
0
1
205
Lisa Wimmer
Lisa Wimmer@WmLisa·
We agree that the uniform distribution can also be questioned as a model a complete ignorance, but as long as knowledge must be represented in terms of probability, is is commonly accepted as a best choice (justified, e.g., by the max entropy principle).
English
1
0
1
232
Lisa Wimmer
Lisa Wimmer@WmLisa·
@BlackHC yeah, I suspected something was off - sorry, will do :)
English
0
0
0
91
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
@WmLisa I think your replies got spread out of order 🥺 could you put all the replies one after another in a thread? 🙏 sorry for this but I don't know which order to read them in and reply to 🙈
English
1
0
0
140
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
Reading "Quantifying Aleatoric and Epistemic Uncertainty in Machine Learning: Are Conditional Entropy and Mutual Information Appropriate Measures?" and not as impressed I had hoped to be. I think it is a good example for why descriptive/axiomatic approaches to uncertainty can be wrong/not helpful. The paper looks at second-order distributions: when our model provides a distribution of distributions. E.g. a Dirichlet distribution (below) is a distribution over the simplex which can be seen as the probabilities of a categorical distribution (here for 3 classes). For example, we have a deep ensemble and each ensemble member predicts a categorical distributions. Then the set of all predictions is an empirical distribution of distributions, and we can define uncertainty measures on it. In particular: - aleatoric uncertainty: (predicted) irreducible observation noise, - epistemic uncertainty: (predicted) reducible noise - if we had more data, the model might give us better predictions, - total uncertainty: something that measures both of the above together. Epistemic uncertainty can be connected to prediction disagreement between ensemble members. The more different ensemble members disagree on a given sample, the more we would learn if we had more data (as we can cull more of the hypothesis space, or redistribute more probability mass in our parameter distribution). Let's assume we look at classification. The paper proposes the following properties (that partially mostly make no sense to) and then examines whether the common EU/AU/TU decomposition satisifes them: A0 TU, AU, and EU are non-negative. A1 EU vanishes for Dirac measures Q=\delta_\theta. A2 EU and TU are maximal for Q being the uniform distribution on \Delta_K^{(2)}. A3 If Q' is a mean-preserving spread of Q, then EU(Q') >= EU(Q) (weak version) or EU(Q') > EU(Q) (strict version); the same holds for TU. A4 If Q' is a center-shift of Q, then AU(Q') >= AU(Q) (weak version) or AU(Q') > AU(Q) (strict version); the same holds for TU. A5 If Q' is a spread-preserving location shift of Q, then EU(Q')=EU(Q). A0 might be sensible --- or if uncertainty was in -\inf, +inf, we could always use exp(.) to make it non-negative. A1 makes sense: if all models output the same distribution, we won't have any epistemic uncertainty. A2 is already wrong, however: Predictions from different ensemble members should have the highest epistemic uncertainty, when they disagree the most with each other. That is the case, when every ensemble member predicts a different class with 100% confidence. In the image above, this would mean that our second-order distribution is concentrated in the corners. But this is not a uniform distribution in the space of second order distributions. Otoh if any ensemble member had shared class predictions (conf > 0 for a class) with another member, obv, their disagreement wouldn't be maximal already. So A2 is wrong (imo). But this also takes down A3: A3 says that if we "spread" out the distributions in the second-order space, EU has to increase (or at least not decrease). But this is not true: if our second-order distribution is concentrated in the corners (EU is maximal), spreading it out can only decrease EU. Hence, a contradiction to A3. A5 is similarly flawed then. It says that if we move our concentration around (but preserve "everything else"), EU has to stay constant. However, let's imagine we have 2 ensemble members predict in a 3-class case, with predictions (0, 1/2, 1/2), and (1/3, 1/3, 1/3), resp (in Dirichlet plot this would be the middle of the x3 edge and the center) . Now let's imagine we shift everything up: to (1/3, 1/3, 1/3) and (1, 0, 0) (in the plot center and upper corner). Again, the disagreement (and EU) is stronger in the latter configuration than in the former, so A5 does not make sense either. Obviously, one could equally argue that my takes on disagreement do not make sense. However, I believe that epistemic uncertainty via disagreement is empirically grounded in the expected information gain (via lots of evidence). (There is more to be said at which information gain we want to look at exactly, but that is for a different post.) But maybe the more acceptable challenge is to why the proposed properties in this paper should be considered natural.
Andreas Kirsch 🇺🇦 tweet mediaAndreas Kirsch 🇺🇦 tweet mediaAndreas Kirsch 🇺🇦 tweet mediaAndreas Kirsch 🇺🇦 tweet media
English
4
9
81
12.2K
Lisa Wimmer أُعيد تغريده
ECML PKDD
ECML PKDD@ECMLPKDD·
📢The 2023 Best Paper Award for the Research Track goes to "Towards Efficient MCMC Sampling in Bayesian Neural Networks by Exploiting Symmetry" by Jonas Gregor Wiese, Lisa Wimmer, Theodore Papamarkou, Bernd Bischl, Stephan Günnemann, David Rügamer 🥳 congrats to the authors!
English
0
8
13
1.9K