Answer by Yoshua Bengio:
There are many ways that neural networks can represent a conditional probability distribution. In the experiments reported in the paper, it is with classical sigmoid output units, each of which represent the probability that an output variable (here the i-th bit X_i to be reconstructed) take the value 1 or 0. In that case, the random bits X_i are assumed to be conditionally independent, given X_tilde. You can choose other kinds of distribution (any parametric distribution, by making these parameters a function of the neural network outputs; here the distribution is factorized Bernoulli, each with probability p_i = sigmoid(a_i), where a_i are the pre-sigmoid outputs of the neural net).
The neural net is trained as usual, by back-propagating the log-likelihood of the outputs (which is the same as the cross-entropy, in the above example). The only difference with ordinary neural nets (but similarly to dropout) is that noise is injected in the neural net (in the inputs and possibly hidden units as well).