OperationalResearch.org

Topics

← Back to Articles

ReLU Activation Function in Neural Networks

1. Introduction

The Rectified Linear Unit (ReLU) is one of the most widely used activation functions in modern neural networks.
Its primary purpose is to introduce non-linearity into the network while maintaining computational efficiency.

Mathematically, ReLU is defined as:

f(x)=max(0,x)f(x) = \max(0, x)

2. How ReLU Works

  • If ( x > 0 ) → ( f(x) = x ) (passes the value as it is).
  • If ( x \leq 0 ) → ( f(x) = 0 ) (neuron output is off).
Input (x) Output (f(x))
-3 0
-0.7 0
0 0
1.2 1.2
5 5

Graphically, ReLU looks like:

f(x)={0if x0xif x>0f(x) = \begin{cases} 0 & \text{if } x \leq 0 \\ x & \text{if } x > 0 \end{cases}

3. Role of ReLU in Neural Networks

Without activation functions, no matter how many layers a neural network has, it will behave as a linear transformation.
ReLU allows the network to learn complex, nonlinear mappings.

Advantages:

  1. Non-linearity: Enables the network to capture complex patterns.
  2. Efficient computation: Just a threshold at zero.
  3. Better gradient flow: Avoids the vanishing gradient problem found in sigmoid/tanh.
  4. Sparse activation: Many outputs are zero, reducing computation and helping regularization.

4. Variants of ReLU

4.1 Leaky ReLU

Instead of outputting zero for negative values, Leaky ReLU allows a small slope:

f(x)={αxif x<0xif x0f(x) = \begin{cases} \alpha x & \text{if } x < 0 \\ x & \text{if } x \geq 0 \end{cases}

Where ( \alpha ) is a small constant (e.g., ( \alpha = 0.01 )).


4.2 Parametric ReLU (PReLU)

Similar to Leaky ReLU, but ( \alpha ) is learned during training:

f(x)={axif x<0xif x0f(x) = \begin{cases} a x & \text{if } x < 0 \\ x & \text{if } x \geq 0 \end{cases}

Where ( a ) is a trainable parameter.


4.3 Exponential Linear Unit (ELU)

ELU smooths the curve for negative inputs:

f(x)={α(ex1)if x<0xif x0f(x) = \begin{cases} \alpha (e^x - 1) & \text{if } x < 0 \\ x & \text{if } x \geq 0 \end{cases}

4.4 GELU (Gaussian Error Linear Unit)

A smoother alternative to ReLU, often used in Transformers:

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

Where ( \Phi(x) ) is the standard Gaussian CDF.


5. Python Code Examples

5.1 Basic ReLU Implementation

import numpy as np

def relu(x):
    return np.maximum(0, x)

# Example
x = np.array([-3, -0.5, 0, 1, 2])
print(relu(x))  # Output: [0.  0.  0.  1.  2.]

5.2 Using ReLU in PyTorch

import torch
import torch.nn as nn

# Example: single layer with ReLU
model = nn.Sequential(
    nn.Linear(5, 3),
    nn.ReLU()
)

x = torch.tensor([[-1.0, 0.5, 2.0, -0.3, 4.0]])
output = model(x)
print(output)

5.3 Using Leaky ReLU in Keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LeakyReLU

model = Sequential()
model.add(Dense(64, input_dim=10))
model.add(LeakyReLU(alpha=0.01))
model.add(Dense(1, activation='sigmoid'))

6. Limitations of ReLU

  • Dying ReLU Problem: Neurons can get stuck outputting only 0 for all inputs.
  • No activation for negative inputs can limit representation power.
  • Possible solutions:
  1. Use Leaky ReLU, PReLU, or ELU.
  2. Careful weight initialization.

RELU | ChatGPT Image

Activation Formula Pros Cons
ReLU max(0,x)\max(0, x) Fast, simple, avoids vanishing gradient Dying ReLU
Leaky ReLU xx if x0x \ge 0, else αx\alpha x Fixes dying neurons Slightly more compute
PReLU Learnable slope for negatives Flexible Risk of overfitting
ELU Smooth negative side Better mean activations Slightly slower
GELU Smooth, Gaussian-based Used in transformers More complex to compute

ORA.ai

🤖

Hello! I'm your AI assistant

Ask me anything about Operations Research, algorithms, or optimization!