The range of activation values of a synthetic neuron is set by its activation function. This is performed on the aggregate of the neuron’s weighted input data. Non-linearity is a feature that distinguishes a tanh activation function. Multilayer perceptrons compute their outputs by multiplying the weights by the input values, which would be all there is to it without the use of an activation function.
Sequential linear operations are equivalent to a single linear operation. The artificial neural network and its approximation function become non-linear with a non-linear activation function. A multilayer perceptron with a single hidden layer and a nonlinear activation function is a universal function approximator, according to the approximation theorem.
What’s the point of Activation Functions?
In neural networks, an activation function imparts a non-linear output. When the activation functions are missing, the neural network can only compute linear mappings from inputs x to outputs y. Just why is this the case?
In the absence of activation functions, all that would happen during forward propagation is the multiplication of weight matrices by input vectors.
Neural networks require the ability to estimate non-linear relationships between input vectors x and output y to calculate truly interesting things. The mapping from x to y is said to be “non-linear” when the underlying data is complex.
Without an activation function in the hidden layer, our neural network would not be able to mathematically realize such complicated interactions, and thus would not be able to complete the tasks we have set for it.
The Big Four of Deep Learning’s Activation Functions
Now would be a good time to talk about the most common activation functions in Deep Learning, as well as the pros and cons of each.
The inverse of the Sigmoid Function
Sigmoid was the most popular activation function until recently. Sigmoid transforms inputs into a 0-1 interval:
The function accepts x as an argument and produces a result in the range (0, 1]). Nowadays, sigmoid nonlinearity is rarely employed in practice. It suffers primarily from two flaws:
Using a Sigmoid effectively “kills” gradients.
The first is that sigmoid functions have the potential for gradients to vanish. The function’s saturation of neuronal activation around 0 and 1 (blue regions) is a major flaw.
In these bluish regions, the sigmoid function’s derivative approaches zero (i.e., large negative or positive input values). A small derivative around 0 would result in a very small gradient of the loss function, preventing weight updates and learning.
Purpose of the tanh activation function
The tanh activation function is another widely used activation function in Deep Learning. A diagram of the tangent hyperbolic function is as follows:
Like the sigmoid function, the derivative of the neuron’s response approaches zero as the value gets very large in either direction (blue region in Fig. 3). Unlike the sigmoid function, its outputs are zero-centered. In practise, tanh is preferred over sigmoid.
Using the following code, we can bring the tanh activation function into TensorFlow:
code: import tf (TensorFlow) as tf from Keras. activations.TensorFlow (tanh)
The result of the expression z = tf.constant([-1.5, -0.2, 0, 0.5], dtype=tf.float32) is the tangent (z)
For example: print(output.numpy()) #[-0.90514827, -0.19737533, 0., 0] .46211714]
How can I create the tanh activation function and its derivative in Python?
Therefore, a tanh activation function and its derivative can be written quite simply. Simply put, the formula requires that we define a function. The following diagram illustrates the method of implementation:
definition of the tanh activation function tanh function(z): return (np.exp(z) – np.exp(-z)) / (np.exp(z) + np.exp(-z))
definition of tanh prime function: return 1 – np. power(tanh function(z),2)
Use the tanh activation function when:
Since tanh activation function values range from -1 to 1, it’s used to centre data by bringing the mean near 0 and allowing easier learning in the next layer. Because of this, the tanh activation function is practical.
Python code for the tanh activation function, written in its simplest form
# import libraries
import matplotlib.pyplot as plt
import NumPy as np
#creating a tanh activation function
return a, da
# Setup centered axes
fig, ax = plt.subplots(figsize=(9, 5))
# Create and show the plot
ax.plot(b,tanh(b), color=”#307EC7″, linewidth=3, label=”tanh”)
ax.plot(b,tanh(b), color=”#9621E2″, linewidth=3, label=”derivative”)
ax.legend(loc=”upper right”, frameon=false)
Following is the output of the aforementioned code, which graphs the tanh and its derivative function.
Activation Function for the Softmax Function
The softmax activation function is the last one I’d want to discuss. In contrast to other activation functions, this one stands out.
To represent probabilities in the interval [0, 1], the softmax activation function constrains the values of the output neurons to be between zero and one.
This means that there is a unique class to which each feature vector, x, belongs. A feature vector that is an image of a dog cannot simultaneously represent the class dog and cat with equal probability. This feature vector must perfectly characterize the dog category.
The probability values of all output neurons must also total one for mutually exclusive classes. This is the only case when the neural network accurately represents a probability distribution. A neural network that assigns an 80% chance to the class dog to an image of a dog, but a 60% chance to an image of a cat, would be an alternative example.
Lucky for us, the softmax function not only restricts the outputs to a value between 0 and 1 but also guarantees that the aggregate of all outputs for all classes always equals 1. Now, let’s look at the inner workings of the softmax function.
Let’s pretend that the neurons in the output layer are fed a vector vecz that is the result of multiplying the output of the previous layer by the weight matrix of the current layer. With softmax activation, a neuron in the output layer takes in a single value, z 1, which is an entry in the vector vec z, and returns a single value, y 1.
With softmax activation, we get the following formula for calculating the output of each neuron in the output layer:
You can see that each neuron’s output, denoted by y j, is dependent not only on the input value z j but on the full vector of z values. Since probabilities range from 0 to 1, this means that every value y of an output neuron is a probability. Adding all output neurons’ likelihood predictions equals 1.
Output neurons indicate the probabilistic distribution of mutually exclusive class labels.
We need to know what Activation Functions to employ.
The best answer I can provide to this question is “it depends,” so that’s what I’ll say.
It depends on the problem’s type and the final product’s intended range.
So, if we want our neural network to predict numbers bigger than 1, we can’t use a tanh or sigmoid in the output layer; instead, we’ll need to utilize ReLU.
In contrast, sigmoid or tanh should be used for the output layer instead of ReLU if the expected range of the output values is [0,1] or [-1, 1].
In classification jobs requiring a probability distribution over mutually exclusive class labels, use softmax in the final neural network layer.
Nonetheless, as a general rule, I advise always employing ReLU as the activation for buried layers.