Let the Machines Learn

Which activation function to use in neural networks?

Activation functions are an integral component in neural networks. There are a number of common activation functions. Due to which it often gets confusing as to which one is best suited for a particular task.

In this blog post I will talk about,

Why do we need activation functions?

Why do we use non-linear activation functions? Couldn’t we just multiply the input with the weight values, add a bias and propagate them forward? The reason we don’t do this is because no matter how many layers we add then, our final output would still be a linear function. It is because of these non-linear activation functions neural networks are considered universal function approximators. Adding non-linearity in the network allows it to approximate any possible function (linear or non-linear).

To get a better understanding of this property, you should definitely check out this awesome post by Michael Nielsen – A Visual Proof that neural nets can compute any function.

Output Layer Activation Functions

It is very important to understand that the output layer activation functions are different from the hidden layer activation functions. The output layer has a very specific objective – to try to replicate the true labels as much as possible.

We need to carefully select the final layer activation depending on the task in hand (regression, single-label classification, multi-label classification etc.)

Softmax Activation

The softmax function takes as input a K-dimensional vector having real values, z and squashes it to a K-dimensional vector f(z) of real values in the range (0, 1] that add up to one. The function is given by,

The softmax is a popular choice for the output layer activation.

When not to use softmax activations?

The softmax function should not be used for multi-label classification. Unlike the one-hot encoded values, there can be more that one label that is true in a multi-label classification (for example, a dog and a bone). The softmax function simply can’t produce more than one label with values close to 1. Therefore, the sigmoid function (discussed later) is preferred for multi-label classification. 

The Softmax function should not be used for a regression task as well. Simple linear units, f(x) = x should be used.

Now, lets discuss some of the popular hidden layer activation functions and then decide which one should be preferred.

Sigmoid Activation

The sigmoid function has the mathematical formula –

σ(x) = 1 ∕ (1 + e-x)

This function takes a real number as input and squashes it in the range (0, 1].  Earlier it was widely used as it has a nice interpretation of the firing rate of a neuron. It converts large negative numbers to 0 (not firing) and large positive numbers to 1 (completely firing).

The sigmoid function is not used any more. It has two major drawbacks –

Tanh Activation

The tanh function has a mathematical formula –

tanh(x) = 2σ(2x) – 1, where σ(x) is the sigmoid function.

It takes a real value as input and squashes it in the range (-1, 1). 

ReLU Activation

The ReLU is the most popular and commonly used activation function. It can be represented as –

f(x) = max(0, x)

It takes a real value as input. The output is x, when x > 0 and is 0, when x < 0It is mostly preferred over sigmoid and tanh.

Advantages

Disadvantages

Leaky ReLU Activation

f(x1(x<0)(αx1(x>=0)(x), where α is a small constant

Leaky ReLUs attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU gives a small negative slope (of 0.01, or so). Some people report success with this form of activation function, but the results are not always consistent.

Parametric ReLU Activation

The PReLU function is given by,

f(x) = max(αx, 0), where α is a hyperparameter.

This gives the neurons the ability to choose what slope is best in the negative region. They can become a ReLU or a leaky ReLU with certain values of α.

Maxout Activation

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is represented as –

max(w1Tb1w2Tb2)

Best Practices (For Hidden Layer Activations)

Recent Developments

Neural Networks is a rapidly evolving field. You should not be surprised if something that you learn today gets replaced by a totally new technique in a few months.

Therefore, I have also included some of the recent developments in activation functions that claim to outperform the current favorite, ReLU. I have not used them before. But, if I am getting unsatisfactory results with the existing techniques, I would definitely try them out.

Swish Activation

The swish activation function is represented as,

f(x) = x * σ(β * x), where 

σ(x) = 1 ∕ (1 + e-x), is the sigmoid function and β is either a constant or a trainable parameter.

According to the paper, Searching for Activation Functions [2] the swish function outperforms ReLU. It was published by the Google Brain team.

E-Swish Activation

The E-Swish function has been introduced in a fairly recent paper that came out in January 2018 – E-swish: Adjusting Activations to Different Network Depths [3]

The mathematical function proposed is,

f(x) = β * x * sigmoid(x), where β is a constant

The recommended value of β is 1 ≤ β ≤ 2. The paper reports that this function outperforms ReLU and also the swish activation function.

More Interesting Papers

Again, these are very new findings and some may not live up to promises. But, it is important for us to keep track of the latest developments.

References

Thank You. 🙂