Abstract
Deep learning is used in a lot of applications, e.g., image recognition, speech recognition, and machine
translation. In many applications of deep learning, softmax is used as an output activation function for modeling
categorical probability distributions. To represent various probabilities, models should output various patterns, i.e.,
models should have sufficient representation capacity. However, softmax can be a bottleneck of representational
capacity (the softmax bottleneck) under a certain condition. In order to break the softmax bottleneck, we propose
a novel output activation function: sigsoftmax. To break the softmax bottleneck, sigsoftmax is composed of
sigmoid and exponential functions. Sigsoftmax can output more various patterns than softmax without additional
parameters and additional computation costs. As a result, the model with sigsoftmax can be more accurate than
that with softmax.
Sekitoshi Kanai, NTT Software Innovation Center
Email: