Nagesh Singh Chauhan
An Overview of Activation Functions in Deep Learning
The article contains a deep dive into various activations functions used in the Deep Learning space.
Deep Neural Networks have been shown to be beneficial for a variety of tasks, in particular allowing for end-to-end learning and reducing the requirement for manual design decisions. However, still many parameters have to be chosen in advance, also raising the need to optimize them. One important, but often ignored system parameter is the selection of a proper activation function. So in this article, we will focus on different activation functions used in Deep Learning.
What are Activation functions?
Activation functions are functions used in neural networks to compute the weighted sum of input and biases, which is used to decide if a neuron can be fired or not. It manipulates the presented data through some gradient processing usually gradient descent and afterward produces an output for the neural network, that contains the parameters in the data. These AFs are often referred to as a transfer function in some literature.
Activation function can be either linear or non-linear depending on the function it represents, and are used to control the outputs of our neural networks, across different domains from object recognition and classification to speech recognition segmentation, scene understanding, and description, machine translation test to speech systems, cancer detection systems, fingerprint detection, weather forecast self-driving cars and other domains to mention a few, with early research results by validating categorically
that a proper choice of activation function improves results in neural network computing.
Various Activation Functions
The Sigmoid AF is sometimes referred to as the logistic function or squashing function in some literature. The Sigmoid function research results have produced three variants of the sigmoid AF, which are used in DL applications. The Sigmoid is a non-linear AF used mostly in feedforward neural networks. It is a bounded differentiable real function, defined for real input values, with positive derivatives everywhere and some degree of smoothness. The Sigmoid function is given by the relationship:
The sigmoid function appears in the output layers of the DL architectures, and they are used for predicting probability-based output and has been applied successfully in binary classification problems, modeling logistic regression tasks as well as other neural network domains, with Neal highlighting the main advantages of the sigmoid functions as, being easy to understand and are used mostly in shallow networks.
Moreover, Glorot and Bengio, 2010 suggest that the Sigmoid AF should be avoided
when initializing the neural network from small random weights.
Sigmoid/Logistic Graph. Credits
However, the Sigmoid AF suffers major drawbacks which include sharp damp gradients during backpropagation from deeper hidden layers to the input layers, gradient saturation, slow convergence, and non-zero centered output thereby causing the gradient updates to propagate in different directions. Other forms of AF including the hyperbolic tangent function was proposed to remedy some of these drawbacks suffered by the Sigmoid AF.
1) Hard Sigmoid Function:
The hard sigmoid activation is another variant of the sigmoid activation function and this function is given by:
This function is a piece-wise linear approximation of the sigmoid function. It is equal to 0 on the range [-Inf; -2.5), then linearly increases from 0 to 1 on the range [-2.5; +2.5] and stays equal to 1 on the range (+2.5; +Inf]. Computing Hard Sigmoid is considered to be faster than computing regular Sigmoid because you won’t have to calculate the exponent, and it provides reasonable results on classification tasks. But exactly because it’s an approximation, it shouldn’t be used for regression tasks, as the error will be much higher than that for regular sigmoid.
Sigmoid vs Hard-Sigmoid AF. Credits
A comparison of the hard sigmoid with the soft sigmoid shows that the hard sigmoid offer lesser computation cost when implemented both in a specialized hardware or software form as outlined, and the authors highlighted that it showed some promising results on DL based binary classification tasks.
2) Sigmoid-Weighted Linear Units (SiLU):
The Sigmoid-Weighted Linear Unit or Sigmoid Linear Unit (SiLU) is a reinforcement learning-based approximation function. It's an activation function that uses the sigmoid function with multiplication itself. In other words, the activation of the SiLU is computed by the sigmoid function multiplied by its input.
This function is given by:
The SiLU function can only be used in the hidden layers of the deep neural networks and only for reinforcement learning-based systems.
3) Derivative of Sigmoid-Weighted Linear Units (dSiLU):
The derivative of the Sigmoid-Weighted Linear Units is the gradient of the SiLU function and is referred to as dSiLU. The dSiLU is used for gradient-descent learning updates for the neural network weight parameters, and the dSiLU is given by
The dSiLU function response looks like an overshooting Sigmoid function as shown in the Figure below. The authors highlighted that the dSiLU outperformed the standard Sigmoid function significantly.
Hyperbolic Tangent Function (Tanh)
Tanh is a smoother zero-centered function whose range lies between -1 to 1, thus the output of the tanh function is given by:
It has been quite popular before the advent of more sophisticated activation functions.
Briefly, the benefits of using TanH instead of Sigmoid are:
Stronger gradients: if the data is centered around 0, the derivatives are higher.
Avoid bias in the gradients because of the inclusion of the range (-1; 0).
However, similar to Sigmoid, TanH is also susceptible to the Vanishing gradient problem
A property of the tanh function is that it can only attain a gradient of 1, only when the value of the input is 0, that is when x is zero. This makes the tanh function produce some dead neurons during computation. The dead neuron is a condition where the activation weight, is rarely used as a result of zero gradients. This limitation of the tanh function spurred further research on inactivation functions to resolve the problem, and it birthed the rectified linear unit (ReLU) activation function.
The tanh functions have been used mostly in recurrent neural networks for natural language processing and speech recognition tasks.
1) Hard Hyperbolic Function: hardtanh
The Hardtanh represents a cheaper and more computational efficient version of tanh. The Hardtanh function lies within the range of -1 to 1 and it is given by:
The hardtanh function has been applied successfully in natural language processing, with the authors reporting that it provided both speed and accuracy improvements.
The softmax function is used to compute probability distribution from a vector of real numbers. The Softmax function produces an output which is a range of values between 0 and 1, with the sum of the probabilities being equal to 1. The Softmax function is computed using the relationship:
The Softmax function is used in multi-class models where it returns probabilities of each class, with the target class having the highest probability. The Softmax function mostly appears in almost all the output layers of the deep learning architectures
The softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid. It also divides each output such that the total sum of the outputs is equal to 1. Credits
Graphical representation of Softmax function. Credits
The main difference between the Sigmoid and Softmax AF is that the Sigmoid is used in binary classification while the Softmax is used for the multivariate classification tasks.
The Softsign is another non-linear AF used in the DL application. It works as a continuous approximation of the sign function. The range of SoftSign is also (-1; +1).
The Softsign function is a quadratic polynomial, given by:
Where |x| = absolute value of the input
The main difference between the Softsign function and the tanh function is that the Softsign converges in a polynomial form, unlike the tanh function which converges exponentially.
The Softsign has been used mostly in regression computation problems but has also been applied to Deep Learning-based text to speech systems, with the authors reporting some promising results using the Softsign function.
Rectified Linear Unit (ReLU) Function
A very simple yet powerful activation function, which outputs the input, if the input is positive, and 0 otherwise. It is claimed that it currently is the most popular activation function for training neural networks, and yields better results than Sigmoid and TanH. This type of activation function is not susceptible to the Vanishing gradient problem, but it may suffer from the “Dying ReLU problem”.
As stated in Wikipedia: “ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies.” In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high.” Here’s a very good description of this issue: https://www.quora.com/What-is-the-dying-ReLU-problem-in-neural-networks.
The ReLU activation function performs a threshold operation to each input element where values less than zero are set to zero thus the ReLU is given by:
This activation function has parameter alpha, which controls the steepness of the line for x < 0 and is set to 0.0. Setting this parameter to any value < 1.0 transforms this activation into Leaky ReLU and setting it to 1.0 makes this function work as Linear activation. What happens, when alpha is > 1.0 will be interesting to investigate.
The main advantage of using the rectified linear units in computation is that they guarantee faster computation since it does not compute exponentials and divisions, with the overall speed of computation enhanced. Another property of the ReLU is that it introduces sparsity in the hidden units as it squishes the values between zero to maximum. However, the ReLU has a limitation in that it easily overfits compared to the sigmoid function although the dropout technique has been adopted to reduce the effect of overfitting of ReLUs and the rectified networks improved the performances of the deep neural networks.
The downside of being zero for all negative values is a problem called “dying ReLU.”
A ReLU neuron is “dead” if it’s stuck on the negative side and always outputs 0. Because the slope of ReLU in the negative range is also 0, once a neuron gets negative, it’s unlikely for it to recover. Such neurons are not playing any role in discriminating the input and are essentially useless. Over time you may end up with a large part of your network doing nothing.
You may be confused as to how this zero-slope section works in the first place. Remember that a single step (in SGD, for example) involves multiple data points. As long as not all of them are negative, we can still get a slope out of ReLU. The dying problem is likely to occur when the learning rate is too high or there is a large negative bias.
To resolve the dead neuron issues, the leaky ReLU was proposed.
1) Leaky ReLU (LReLU):
Leaky ReLU (LReLU) introduces some small negative slope to the ReLU to sustain and keep the weight updates alive during the entire propagation process. The alpha parameter was introduced as a solution to the ReLUs dead neuron problems such that the gradients will not be zero at any time during training. The LReLU computes the gradient with a very small constant value for the negative gradient α in the range of 0.01 thus the LReLU AF is computed as:
The LReLU has an identical result when compared to the standard ReLU with the exception that it has non-zero gradients over the entire duration thereby suggesting that there is no significant result improvement except in sparsity and dispersion when compared to the standard ReLU and tanh functions.
Graphical representation of Leaky RELU function. Credits
2) Parametric Rectified Linear Units (PReLU):
It can be regarded as a variant of Leaky ReLU. The difference is that the slope of the negative part in P-ReLU is determined according to the data, that is, the value of a is not a constant. In other words, the negative part of the P-ReLU function is adaptively learned while the positive part is linear.
The PReLU is given by:
Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter.
if aᵢ=0, f becomes ReLU
if aᵢ>0, f becomes leaky ReLU
if aᵢ is a learnable parameter, f becomes PReLU
Graphical representation of Leaky RELU and Parametric RELU functions. Credits
3) Randomized Leaky ReLU (RReLU):
RReLU is the activation function that randomly samples the negative slope for activation values. It was first proposed and used in the Kaggle NDSB Competition. During training, a random number sampled from a uniform distribution U (l, u) is used to train the network.
The randomized ReLU is given by:
Where aji ∼ U (l, u), l < u and l, u ∈ [0, 1]
Graphical representation of Randomized RELU. Credits
4) S-shaped ReLU (SReLU):
The S-shaped Rectified Linear Unit, or SReLU, is an activation function for neural networks. It learns both convex and non-convex functions, imitating the multiple function forms given by the two fundamental laws, namely the Webner-Fechner law and the Stevens law, in psychophysics and neural sciences. Specifically, SReLU consists of three piecewise linear functions, which are formulated by four learnable parameters.
The SReLU is defined as a mapping:
In the below figure, The function forms of Webner-Fechner law and Stevens law along with the proposed SReLU. (a) shows the logarithm function. (b) shows the power function with different exponents. (c) and (d) are different forms of SReLU by changing the parameters. The positive part of (c) and (d) are derived by imitating the logarithm function (a) and power function (b), respectively.
Graphical representation of Randomized RELU. Credits
The derivative of the softplus function is the logistic function. ReLU and Softplus are largely similar, except near 0(zero) where the softplus is enticingly smooth and differentiable. It’s much easier and more efficient to compute ReLU and its derivative than for the softplus function which has log(.) and exp(.) in its formulation.
The Softplus function is a primitive of the sigmoid function, given by the relationship:
Softplus function ranges from (0, ∞).
Graphical representation of Softplus function. Credits
Exponential Linear Units (ELUs)
Exponential Linear Units are activation functions that are used to speed up the training of deep neural networks. The main advantage of the ELUs is that they can alleviate the vanishing gradient problem by using identity for positive values and also improves the learning characteristics. They have negative values which allow for pushing of mean unit activation closer to zero thereby reducing computational complexity thereby improving learning speed. The ELU represents a good alternative to the ReLU as it decreases bias shifts by pushing mean activation towards zero during training.
The exponential linear unit (ELU) with 0<α is:
Where α= ELU hyperparameter that controls the saturation point for negative net inputs which is usually set to 1.0
ELU ranges from (-α, +∞).
The ELUs has a clear saturation plateau in their negative regime thereby learning more robust representations, and they offer faster learning and better generalization compared to the ReLU and LReLU with a specific network structure, especially above five layers, and guarantees state-of-the-art results compared to ReLU variants. However, a critical limitation of the ELU is that the ELU does not center the values at zero, and the parametric ELU was proposed to address this issue.
Graphical representation of Exponential Linear Units. Credits
1) Parametric Exponential Linear Unit (PELU):
The parametric ELU is another parameterized version of the exponential linear unit (ELUs), which tries to address the zero center issue found in the ELUs. The PELU was proposed by Trottier et al., 2017, and it uses the PELU in the context of vanishing gradient to provide some gradient-based optimization framework used to reduce bias shifts while maintaining the zero center of values
The PELU has two additional parameters over the ELU:
Where a, b, and c>0. Here c causes a change in the slope in the positive quadrant, b controls the scale of the exponential decay, and α controls the saturation in the negative quadrant.
Graphical representation of Parametric Exponential Linear Unit. Credits
The PELU promises to be a good option for applications that require fewer bias shifts and vanishing gradients like the CNNs.
2) Scaled Exponential Linear Units (SELU):
The scaled exponential linear unit (SELU) is another variant of the ELUs, proposed by Klambauer et al., 2017. The SELU was introduced as a self-normalizing neural network that has a peculiar property of inducing self-normalizing properties. It has a close to zero mean and unit variance that converges towards zero mean and unit variance when propagated through multiple layers during network training, thereby making it suitable for deep learning application, and with strong regularisation, learns robust features efficiently.
The SELU is given by:
λ ≈ 1.0507, α ≈ 1.6732
SELU has values of alpha α and lambda λ predefined.
Here’s the main advantage of SELU over ReLU:
Internal normalization is faster than external normalization, which means the network converges faster.
SELU is a relatively newer activation function and needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored.
Graphical representation of Scaled Exponential Linear Unit (SELU) activation. Credits
The SELUs are not affected by vanishing and exploding gradient problems and it has been observed that they allow the construction of mappings with properties leading to self normalizing neural networks which cannot be derived using ReLU, scaled ReLU, sigmoid, LReLU, and even tanh functions.
The Maxout activation function is a function where non-linearity is applied as a dot product between the weights of a neural network and data. The Maxout generalizes the leaky ReLU and ReLU where the neuron inherits the properties of ReLU and leaky ReLU where no dying neurons or saturation exist in the network computation. The Maxout function is given by:
Where w = weights, b = biases, T = transpose.
The Maxout function has been tested successfully in phone recognition applications.
The main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.
Depiction of how the Maxout activation function can implement the ReLU, Absolute function, and approximate the quadratic function. A Maxout unit can approximate arbitrary convex functions. Credits.
The Swish activation function is one of the first compound AF proposed by the combination of the sigmoid AF and the input function, to achieve a hybrid AF. The Swish activation uses the reinforcement learning-based automatic search technique to compute the function. The properties of the Swish function include smoothness, non-monotonic, bounded below, and unbounded in the upper limits. The smoothness property makes the Swish function produce better optimization and generalization results when used in training deep learning architectures.
The Swish function is given by:
f (x) = x · sigmoid(x)
The main advantages of the Swish function are the simplicity and improved accuracy as the Swish does not suffer vanishing gradient problems but provides good information propagation during training and reported that the Swish AF outperformed the ReLU activation function on deep learning classification tasks.
Graphical representation of Swish activation function. Credits
Gaussian Error Linear Unit (GELU)
The Gaussian Error Linear Unit (GELU) activation function is compatible with BERT, ROBERTa, ALBERT, and other top NLP models. This activation function is motivated by combining properties from dropout, zoneout, and ReLUs.
ReLU and dropout together yield a neuron’s output. ReLU does it deterministically by multiplying the input by zero or one (depending upon the input value being positive or negative) and dropout stochastically by multiplying by zero.
RNN regularizer called zoneout stochastically multiplies inputs by one.
We merge this functionality by multiplying the input by either zero or one which is stochastically determined and is dependent upon the input. We multiply the neuron input x by:
GELU nonlinearity is better than ReLU and ELU activations and finds performance improvements across all tasks in domains of computer vision, natural language processing, and speech recognition.
Graphical representation of Gaussian Error Linear Unit. Credits
Exponential linear Squashing (ELiSH)
The ELiSH shares common properties with the Swish function. The ELiSH function is made up of the ELU and Sigmoid functions and it is given by:
The properties of the ELiSH function vary in both the negative and positive parts as defined by the limits. The Sigmoid part of the ELiSH function improves information flow while the Linear parts eliminate the vanishing gradient issues. The ELiSH function has been applied successfully on ImageNet dataset using different deep convolutional architecture
Graphical representation of ELiSH. Credit
The HardELiSH is the hard variant of the ELiSH activation function. The HardELiSH is a multiplication of the HardSigmoid and ELU in the negative part and a multiplication of the Linear and the HardSigmoid in the positive part. The HardELiSH is given by:
The HardELiSH function was tested on ImageNet classification dataset.
Graphical representation of HardELiSH. Credits
The AF is a key component for the training and optimization of neural networks, implemented on different layers of DL architectures, is used across domains including natural language processing, object detection, classification, and segmentation, etc.
Among all the AF we have discussed above ReLU and its variants should be preferred over sigmoid or tanh activation functions. As well as ReLUs are faster to train. If ReLU is causing neurons to be dead, use Leaky ReLUs or its other variants. Sigmoid and tanh suffer from vanishing gradient problems and should not be used in the hidden layers. ReLUs are best for hidden layers. Activation functions that are easily differentiable and easy to train should be used.