top of page
  • Writer's pictureNagesh Singh Chauhan

Loss Functions in Neural Networks

Updated: Sep 20, 2021

The article contains a brief on various loss functions used in Neural networks.

What is a Loss function?


When you train Deep learning models, you feed data to the network, generate predictions, compare them with the actual values (the targets) and then compute what is known as a loss. This loss essentially tells you something about the performance of the network: the higher it is, the worse your network performs overall.


Loss functions are mainly classified into two different categories Classification loss and Regression Loss. Classification loss is the case where the aim is to predict the output from the different categorical values for example, if we have a dataset of handwritten images and the digit is to be predicted that lies between (0–9), in these kinds of scenarios classification loss is used.


Whereas if the problem is regression like predicting the continuous values for example, if need to predict the weather conditions or predicting the prices of houses on the basis of some features. In this type of case, Regression Loss is used.


In this article, we will focus on the most widely used loss functions in Neural networks.

  • Mean Absolute Error (L1 Loss)

  • Mean Squared Error (L2 Loss)

  • Huber Loss

  • Cross-Entropy(a.k.a Log loss)

  • Relative Entropy(a.k.a Kullback–Leibler divergence)

  • Squared Hinge


Mean Absolute Error (MAE)


Mean absolute error (MAE) also called L1 Loss is a loss function used for regression problems. It represents the difference between the original and predicted values extracted by averaging the absolute difference over the data set.

MAE is not sensitive towards outliers and is given several examples with the same input feature values, and the optimal prediction will be their median target value. This should be compared with Mean Squared Error, where the optimal prediction is the mean. A disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only on the sign of y — ŷ which leads to that the gradient magnitude will be large even when the error is small, which in turn can lead to convergence problems.


When to use it?


Use Mean absolute error when you are doing regression and don’t want outliers to play a big role. It can also be useful if you know that your distribution is multimodal, and it’s desirable to have predictions at one of the modes, rather than at the mean of them.


Example: When doing image reconstruction, MAE encourages less blurry images compared to MSE. This is used for example in the paper Image-to-Image Translation with Conditional Adversarial Networks by Isola et al.


Mean Squared Error (MSE)


Mean Squared Error (MSE) also called L2 Loss is also a loss function used for regression. It represents the difference between the original and predicted values extracted by squared the average difference over the data set.

MSE is sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their mean target value. This should be compared with Mean Absolute Error, where the optimal prediction is the median. MSE is thus good to use if you believe that your target data, conditioned on the input, is normally distributed around a mean value, and when it’s important to penalize outliers extra much.

When to use it?


Use MSE when doing regression, believing that your target, conditioned on the input, is normally distributed, and want large errors to be significantly (quadratically) more penalized than small ones.


Example: You want to predict future house prices. The price is a continuous value, and therefore we want to do regression. MSE can here be used as the loss function.


Calculate MAE and MSE using Python


Original target data is denoted by y and predicted label is denoted by (Ŷ) Yhat are the main sources to evaluate the model.


import math
import numpy as np
import matplotlib.pyplot as plt
y = np.array([-3, -1, -2, 1, -1, 1, 2, 1, 3, 4, 3, 5])
yhat = np.array([-2, 1, -1, 0, -1, 1, 2, 2, 3, 3, 3, 5])
x = list(range(len(y)))
#We can visualize them in a plot to check the difference visually.
plt.figure(figsize=(9, 5))
plt.scatter(x, y, color="red", label="original")
plt.plot(x, yhat, color="green", label="predicted")
plt.legend()
plt.show()


# calculate MSE
d = y - yhat
mse_f = np.mean(d**2)
print("Mean square error:",mse_f)

Mean square error: 0.75


# calculate MAE
mae_f = np.mean(abs(d))
print("Mean absolute error:",mae_f)

Mean absolute error: 0.5833333333333334


Huber Loss


Huber Loss is typically used in regression problems. It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.


Consider an example where we have a dataset of 100 values we would like our model to be trained to predict. Out of all that data, 25% of the expected values are 5 while the other 75% are 10.


An MSE loss wouldn’t quite do the trick, since we don’t really have “outliers”; 25% is by no means a small fraction. On the other hand, we don’t necessarily want to weigh that 25% too low with an MAE. Those values of 5 aren’t close to the median (10 — since 75% of the points have a value of 10), but they’re also not really outliers.


This is where the Huber Loss Function comes into play.


The Huber Loss offers the best of both worlds by balancing the MSE and MAE together. We can define it using the following piecewise function:

Here, (𝛿) delta → hyperparameter defines the range for MAE and MSE.


In simple terms, the above radically says is: for loss values less than (𝛿) delta, use the MSE; for loss values greater than delta, use the MAE. This way Huber loss provides the best of both MAE and MSE.


Set delta to the value of the residual for the data points, you trust.



import numpy as np
import matplotlib.pyplot as plt

def huber(a, delta):
  value = np.where(np.abs(a)<delta, .5*a**2, delta*(np.abs(a) - .5*delta))
  deriv = np.where(np.abs(a)<delta, a, np.sign(a)*delta)
  return value, deriv

h, d = huber(np.arange(-1, 1, .01), delta=0.2)

fig, ax = plt.subplots(1)
ax.plot(h, label='loss value')
ax.plot(d, label='loss derivative')
ax.grid(True)
ax.legend()

In the above figure, you can see how the derivative is a constant for abs(a)>delta


In TensorFlow 2 and Keras, Huber loss can be added to the compile step of your model.


model.compile(loss=tensorflow.keras.losses.Huber(delta=1.5), optimizer='adam', metrics=['mean_absolute_error'])

When to use Huber Loss?


As we already know Huber loss has both MAE and MSE. So when we think higher weightage should not be given to outliers, then set your loss function as Huber loss. We need to manually define is the (𝛿) delta value. Generally, some iterations are needed with the respective algorithm used to find the correct delta value.


Cross-Entropy Loss(a.k.a Log loss)


The concept of cross-entropy traces back into the field of Information Theory where Claude Shannon introduced the concept of entropy in 1948. Before diving into the Cross-Entropy loss function, let us talk about Entropy.


Entropy has roots in physics — it is a measure of disorder, or unpredictability, in a system.


For instance, consider below figure two gases in a box: initially, the system has low entropy, in that the two gasses are completely separable(skewed distribution); after some time, however, the gases blend(distribution where events have equal probability) so the system’s entropy increases. It is said that in an isolated system, the entropy never decreases — the chaos never dims down without external influence.



For p(x) — probability distribution and a random variable X, entropy is defined as follows:

Reason for the Negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability distribution and therefore the values must range between 0 and 1.



Cross-Entropy loss is also called logarithmic loss, log loss, or logistic loss. Each predicted class probability is compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic in nature yielding a large score for large differences close to 1 and small score for small differences tending to 0.


Cross-Entropy is expressed by the equation;

Where x represents the predicted results by ML algorithm, p(x) is the probability distribution of “true” label from training samples and q(x) depicts the estimation of the ML algorithm.


Cross-entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.


The graph above shows the range of possible loss values given a true observation. As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!

The cross-entropy method is a Monte Carlo technique for significance optimization and sampling.


Binary Cross-Entropy


Binary cross-entropy is a loss function that is used in binary classification tasks. These are tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right).

In binary classification, where the number of classes M equals 2, cross-entropy can be calculated as:

Sigmoid is the only activation function compatible with the binary cross-entropy loss function. You must use it on the last block before the target block.


The binary cross-entropy needs to compute the logarithms of Ŷi and (1-Ŷi), which only exist if Ŷi​ is between 0 and 1. The softmax activation function is the only one to guarantee that the output is within this range.


Categorical Cross-Entropy


Categorical cross-entropy is a loss function that is used in multi-class classification tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one.

Formally, it is designed to quantify the difference between two probability distributions.

If 𝑀>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

  • M — number of classes (dog, cat, fish)

  • log — the natural log

  • y — binary indicator (0 or 1) if class label c is the correct classification for observation o

  • p — predicted probability observation o is of class 𝑐


Softmax is the only activation function recommended to use with the categorical cross-entropy loss function.


Strictly speaking, the output of the model only needs to be positive so that the logarithm of every output value Ŷi​ exists. However, the main appeal of this loss function is for comparing two probability distributions. The softmax activation rescales the model output so that it has the right properties.


Sparse Categorical Cross-Entropy


sparse categorical cross-entropy has the same loss function as, categorical cross-entropy which we have mentioned above. The only difference is the format in which we mention 𝑌𝑖(i,e true labels).


If your Yi’s are one-hot encoded, use categorical_crossentropy. Examples for a 3-class classification: [1,0,0] , [0,1,0], [0,0,1]


But if your Yi’s are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]


The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross-entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.


Calculate Cross-Entropy Between Class Labels and Probabilities


The use of cross-entropy for classification often gives different specific names based on the number of classes.


Consider a two-class classification task with the following 10 actual class labels (P) and predicted class labels (Q).


# calculate cross entropy for classification problem
from math import log
from numpy import mean
 
# calculate cross entropy
def cross_entropy_funct(p, q):
 return -sum([p[i]*log(q[i]) for i in range(len(p))])
 
# define classification data p and q
p = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
q = [0.7, 0.9, 0.8, 0.8, 0.6, 0.2, 0.1, 0.4, 0.1, 0.3]
# calculate cross entropy for each example
results = list()
for i in range(len(p)):
 # create the distribution for each event {0, 1}
 expected = [1.0 - p[i], p[i]]
 predicted = [1.0 - q[i], q[i]]
 # calculate cross entropy for the two events
 cross = cross_entropy_funct(expected, predicted)
 print('>[y=%.1f, yhat=%.1f] cross entropy: %.3f' % (p[i], q[i], cross))
 results.append(cross)
 
# calculate the average cross entropy
mean_cross_entropy = mean(results)
print('\nAverage Cross Entropy: %.3f' % mean_cross_entropy)

Running the example prints the actual and predicted probabilities for each example. The final average cross-entropy loss across all examples is reported, in this case, as 0.272


>[y=1.0, yhat=0.7] cross entropy: 0.357 >[y=1.0, yhat=0.9] cross entropy: 0.105 >[y=1.0, yhat=0.8] cross entropy: 0.223 >[y=1.0, yhat=0.8] cross entropy: 0.223 >[y=1.0, yhat=0.6] cross entropy: 0.511 >[y=0.0, yhat=0.2] cross entropy: 0.223 >[y=0.0, yhat=0.1] cross entropy: 0.105 >[y=0.0, yhat=0.4] cross entropy: 0.511 >[y=0.0, yhat=0.1] cross entropy: 0.105 >[y=0.0, yhat=0.3] cross entropy: 0.357


Average Cross Entropy: 0.272


Relative Entropy(Kullback–Leibler divergence)


The Relative entropy (also called Kullback–Leibler divergence), is a method for measuring the similarity between two probability distributions. It was refined by Solomon Kullback and Richard Leibler for public release in 1951(paper), KL-Divergence aims to identify the divergence(separation or bifurcation) of a probability distribution given a baseline distribution. That is, for a target distribution, P, we compare a competing distribution, Q, by computing the expected value of the log-odds of the two distributions:


For distributions P and Q of a continuous random variable, the Kullback-Leibler divergence is computed as an integral:


If P and Q represent the probability distribution of a discrete random variable, the Kullback-Leibler divergence is calculated as a summation:


Also, with a little bit of work, we can show that the KL-Divergence is non-negative. It means, that the smallest possible value is zero (distributions are equal) and the maximum value is infinity. We procure infinity when P is defined in a region where Q can never exist. Therefore, it is a common assumption that both distributions exist on the same support.


The closer two distributions get to each other, the lower the loss becomes. In the following graph, the blue distribution is trying to model the green distribution. As the blue distribution comes closer and closer to the green one, the KL divergence loss will get closer to zero.


Lower the KL divergence value, the better we have matched the true distribution with our approximation.

Comparison of Blue and green distribution


The applications of KL-Divergence:

  1. Primarily, it is used in Variational Autoencoders. These autoencoders learn to encode samples into a latent probability distribution and from this latent distribution, a sample can be drawn that can be fed to a decoder which outputs e.g. an image.

  2. KL divergence can also be used in multiclass classification scenarios. These problems, which traditionally use the Softmax function and use one-hot encoded target data, are naturally suitable to KL divergence since Softmax “normalizes data into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers”

  3. Delineating the relative (Shannon) entropy in information systems,

  4. Randomness in continuous time-series.


Calculate KL-Divergence using Python


Consider a random variable with six events as different colors. We may have two different probability distributions for this variable; for example:

import numpy as np
import matplotlib.pyplot as plt

events = ['red', 'green', 'blue', 'black', 'yellow', 'orange']
p = [0.10, 0.30, 0.05, 0.90, 0.65, 0.21]
q = [0.70, 0.55, 0.15, 0.04, 0.25, 0.45]

Plot a histogram for each probability distribution, allowing the probabilities for each event to be directly compared.

# plot first distribution
plt.figure(figsize=(9, 5))
plt.subplot(2,1,1)
plt.bar(events, p, color ='green',align='center')
# plot second distribution
plt.subplot(2,1,2)
plt.bar(events, q,color ='green',align='center')
# show the plot
plt.show()

We can see that indeed the distributions are different.


Next, we can develop a function to calculate the KL divergence between the two distributions.

def kl_divergence(p, q):
    return sum(p[i] * np.log(p[i]/q[i]) for i in range(len(p)))

 # calculate (P || Q)
kl_pq = kl_divergence(p, q)
print('KL(P || Q): %.3f bits' % kl_pq)
# calculate (Q || P)
kl_qp = kl_divergence(q, p)
print('KL(Q || P): %.3f bits' % kl_qp)

KL(P || Q): 2.832 bits

KL(Q || P): 1.840 bits


Nevertheless, we can calculate the KL divergence using the rel_entr() SciPy function and confirm that our manual calculation is correct.


The rel_entr() function takes lists of probabilities across all events from each probability distribution as arguments and returns a list of divergences for each event. These can be summed to give the KL divergence.


from scipy.special import rel_entr

print("Using Scipy rel_entr function")

bo_1 = np.array(p)
bo_2 = np.array(q)

print('KL(P || Q): %.3f bits' % sum(rel_entr(bo_1,bo_2)))
print('KL(Q || P): %.3f bits' % sum(rel_entr(bo_2,bo_1)))

Using Scipy rel_entr function

KL(P || Q): 2.832 bits

KL(Q || P): 1.840 bits


Let us see how KL divergence can be used with Keras. It's pretty simple, It just involves specifying it as the used loss function during the model compilation step:


# Compile the model model.compile(loss=keras.losses.kullback_leibler_divergence, optimizer=keras.optimizers.Adam(), metrics=[‘accuracy’])

Squared Hinge


The squared hinge loss is a loss function used for “maximum margin” binary classification problems. Mathematically it is defined as:

where ŷ is the predicted value and y is either 1 or -1.


Thus, the squared hinge loss → 0, when the true and predicted labels are the same and when ŷ≥ 1 (which is an indication that the classifier is sure that it’s the correct label).


The squared hinge loss → quadratically increasing with the error, when when the true and predicted labels are not the same or when ŷ< 1, even when the true and predicted labels are the same (which is an indication that the classifier is not sure that it’s the correct label).


As compared to traditional hinge loss(used in SVM) larger errors are punished more significantly, whereas smaller errors are punished slightly lighter.


When to use Squared Hinge?


Use the Squared Hinge loss function on problems involving yes/no (binary) decisions. Especially, when you’re not interested in knowing how certain the classifier is about the classification. Namely, when you don’t care about the classification probabilities. Use in combination with the tanh() the activation function in the last layer of the neural network.


A typical application can be classifying email into ‘spam’ and ‘not spam’ and you’re only interested in the classification accuracy.


Let us see how Squared Hinge can be used with Keras. It’s pretty simple, It just involves specifying it as the used loss function during the model compilation step:


#Compile the model
model.compile(loss=squared_hinge, optimizer=tensorflow.keras.optimizers.Adam(lr=0.03), metrics=['accuracy'])


Feel free to connect me on LinkedIn for any query.


Thank you for reading this article, I hope you have found it useful.


References


14,669 views3 comments

Recent Posts

See All
bottom of page