What does it mean that classes are mutually exclusive but soft-labels are accepted?
As it can be seen from here, tf.nn.softmax produces just the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1; it's a way of normalizing. The shape of the output of a softmax is the same as the input - it just normalizes the values. The outputs of softmax can be interpreted as probabilities. In contrast, tf.nn.softmax_cross_entropy_with_logits computes the cross-entropy of the result after applying the softmax function (but it does it all together in a more mathematically careful way). It's mathematically careful due to the fact that $y_i$ in $log(y_i)$ can be zero. As you can read from here, A randomly-initialized softmax layer is extremely unlikely to output an exact 0 in any class. But it is possible, so worth allowing for it. First, don't evaluate log(yi) for any y′i=0, because the negative classes always contribute 0 to the error. Second, in the practical code you can limit the value to something like $log( max( y_predict, 1e-15 ) )$ for numerical stability - in many cases, it is not required, but this is sensible defensive programming. I encourage you to take a look at the answers to this question.
NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.
The first sentence means that your classifier may not be able to classify the labels exactly as they are, one-hot-encoded. What it does is to find the chance that how likely it is that the input belongs to each class. And this won't make a problem if it does not have mutually exclusive output vector as the algorithm implies. It just needs a vector that the sum of its entries is equal to one. If they are not, the computation of the gradient will be incorrect. I guess this line is added to announce that the outputs of this differentiable component will not be one-hot-encoded and it's because of the nature of these nets. The first layers of convolutional networks are like bases vectors and each instance of classes share these bases. All inputs are made up of these bases.
I'm wondering if, in a multi-class exclusive case where the only constraint on the labels is that they have to be a valid probability distribution, labels = [0.5 0.5] should be a valid instance label. This label means that the annotator nor the net can tell if this ground-truth instance belongs to class_0 or class_1...
Basically, for multi-label classification, your input may have different labels. Consequently, your classes won't be mutually exclusive anymore. Moreover, in those cases, we don't use softmax as the last layer. You have to have sigmoid for each output and your cross-entropy cost function will be slightly different 1. Consequently, the output of each entry should be a valid probability, that is why sigmoid is used, and the label vectors for such tasks is a not one-hot-encoded anymore. different classes have different entries and based on the existence of instances of each class, the corresponding entry should be one.