I originally wanted to write down the proof for the Gumbel-max trick but soon realized it is actually the same idea as a much more common problem: exponential race. So, in this note let’s go from this common problem and arrive at the Gumbel-max trick.
As a preparation let’s solve a probability problem first.
There are clocks started simultaneously, such that the -th clock rings after a random time
(1) Designate as the random time after which some clock (i.e any one of the clocks) rings, find the distribution of
(2) Find the probability of the -th clock rings first
Let and and be the CDFs. We also have .
Following order statistics of , we have or equivalently,
i.e. the of a set of i.i.d exponential random variables is still an exponential random variable with the rate being the sum of the rates of that set of random variables.
For the second part of the problem, we can consider two competing alarms and to begin with. Our goal is to find .
Now, let’s consider one specific clock versus all the rest, noted as . According to , we know that . Using the result above we have the solution for part (2) as follows
Of course, we can do the integration directly and get the same result
By the way, this setup with multiple exponential random variables and we look for the first arrival is also called exponential race.
I just made up the name “Exponential-Min”. The better name for this section is probably Sampling from Multinomial by Argmining.
Suppose we have a set of positive numbers . Correspondingly we have a normalized probabiilty vector , where . This probability vector specifies a multinominal distribution over choices.
Now, if we were to get a sample according to this multinominal distribution specified by (which is fundamentally specified by ), what should we do?
Normally, we do the following:
- We have
- We compute , where .
- We generate a uniform random number between 0 and 1, i.e.
- We figure out where lands, i.e. if we return . (Of couse we should use and )
But that’s the boring way. Now we have this new Exponential-Min trick, we can do the following:
- We have
- We don’t compute ; instead we sample for , i.e. we have a total of samples, one from each
- We now take as our result sample
- We proved in that such a result sample indeed follows multinominal distribution specified by , where .
Thus, somehow we ended up Sampling from Multinomial by Argmining!
Now let’s move on to Gumbel distribution from Exponential distribution.
Gumbel distribution with unit scale () is parameterized by location parameter . has CDF and PDF as follows
Given a set of independent Gumbel random variables , each with their own parameter , i.e. .
Gumbel distribution has two properties that are quite analogous the exponential race example above.
- (1) Let , then
The proof is straightforward and similar to above:
- (2) A corollary of the above is that the probability of being the max is
Now here we can tell nearly an identical/parallel story as in the section Exponential-Min Trick. And, this section should really be called Sampling from Multinomial by Argmaxing.
The main differences are
- The numbers (parameters) can be potentially negative, whereas must be positive
- The probability vector is determined by instead of
- We generate samples with instead of
- We take over instead of taking over
When is Gumbel-Max Trick Useful?
It seems a lot of work to sample multinominal by argmaxing over Gumble samples (or argmining over Exponential samples). In what situation do we ever want to do this?
The short answer is that Gumble-Max trick allows us to make a sampling step differentiable. Specifically, it makes sampling from multinomial distribution differentiable. We’ll take a closer look at this in a future post but pause for a second and think about it. We are saying it is possible to differentiate through the action of drawing a discrete sample from a multinomial distribution! This was a pretty surprising/amazing possibility to me.
Regarding downstream applications, differentiating through sampling is an important “trick” in neural network based variational inference in general. Multinomial discrete random variables are prevalent in many learning problems. Gumble-max trick allows us to work with them in many interesting neural variational inference problems, which we will look into in future posts.