TL;DR
Neural networks are computational models inspired by the human brain's neuron connections, forming the foundation of modern deep learning and artificial intelligence. This guide covers biological vs artificial neurons, network architecture (input, hidden, output layers), activation functions (ReLU, Sigmoid, Tanh), forward and backpropagation algorithms, loss functions and optimizers, and mainstream architectures like CNN, RNN, and Transformer, with practical PyTorch and TensorFlow code examples.
Introduction
Neural networks are among the most important technologies in artificial intelligence. From image recognition to natural language processing, from autonomous driving to medical diagnosis, neural networks are transforming our world. Understanding how neural networks work not only helps you better utilize AI tools but also builds a solid foundation for deeper exploration of deep learning.
In this guide, you will learn:
- Comparison between biological and artificial neurons
- Basic structure and mathematical principles of neural networks
- Characteristics and selection criteria for common activation functions
- How forward propagation and backpropagation work
- The role of loss functions and optimizers
- Mainstream architectures: CNN, RNN, Transformer
- Practical code with PyTorch and TensorFlow
From Biological to Artificial Neurons
How Biological Neurons Work
The human brain contains approximately 86 billion neurons, each connected to others through synapses, forming complex neural networks. The basic workflow of biological neurons:
- Dendrites receive signals from other neurons
- Cell body integrates and processes signals
- When signal strength exceeds a threshold, the neuron activates
- Axon transmits signals to the next neuron
Mathematical Model of Artificial Neurons
Artificial neurons (also called perceptrons) simulate how biological neurons work:
Mathematical expression:
y = f(w₁x₁ + w₂x₂ + w₃x₃ + b) = f(Σwᵢxᵢ + b)
Where:
- xᵢ: Input features
- wᵢ: Weights, representing the importance of each input
- b: Bias, adjusting the activation threshold
- f: Activation function, introducing non-linearity
Biological vs Artificial Neurons
| Property | Biological Neuron | Artificial Neuron |
|---|---|---|
| Signal Type | Electrochemical | Numerical |
| Connection | Synapse | Weight |
| Activation | Threshold Trigger | Activation Function |
| Learning | Synaptic Plasticity | Gradient Descent |
| Processing Speed | Milliseconds | Nanoseconds |
| Power Consumption | ~20 Watts | Hundreds to Thousands of Watts |
Basic Neural Network Structure
Neural networks consist of multiple layers of neurons, with each layer fully connected to adjacent layers.
Three-Layer Basic Architecture
Role of Each Layer
Input Layer
- Receives raw data
- Number of neurons equals feature dimensions
- No computation, only data passing
Hidden Layer
- Extracts and transforms features
- Can have multiple layers (deep learning)
- Number of neurons is a hyperparameter
Output Layer
- Produces final predictions
- Classification: neurons equal number of classes
- Regression: typically one neuron
Deep Neural Networks
When the number of hidden layers increases, the network becomes a Deep Neural Network (DNN):
Input Layer → Hidden Layer 1 → Hidden Layer 2 → ... → Hidden Layer N → Output Layer
Deep networks can learn more complex feature hierarchies:
- Shallow layers: Learn simple features (edges, textures)
- Middle layers: Learn combined features (shapes, parts)
- Deep layers: Learn abstract features (objects, concepts)
Activation Functions Explained
Activation functions introduce non-linearity to neural networks, enabling them to learn complex patterns.
ReLU (Rectified Linear Unit)
ReLU is currently the most commonly used activation function:
def relu(x):
return max(0, x)
f(x) = max(0, x)
Advantages:
- Simple and efficient computation
- Mitigates vanishing gradient problem
- Sparse activation improves efficiency
Disadvantages:
- Dead ReLU problem (neurons permanently inactive)
- Non-zero centered output
Sigmoid
Sigmoid compresses output to the (0, 1) range:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
f(x) = 1 / (1 + e^(-x))
Advantages:
- Bounded output range
- Suitable for binary classification output
Disadvantages:
- Vanishing gradient problem
- Non-zero centered output
- Relatively complex computation
Tanh (Hyperbolic Tangent)
Tanh compresses output to the (-1, 1) range:
def tanh(x):
return np.tanh(x)
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Advantages:
- Zero-centered output
- Stronger gradients than Sigmoid
Disadvantages:
- Still suffers from vanishing gradient
Activation Function Selection Guide
| Scenario | Recommended Activation |
|---|---|
| Hidden layers (default) | ReLU or variants |
| Binary classification output | Sigmoid |
| Multi-class output | Softmax |
| RNN hidden layers | Tanh |
| Prevent Dead ReLU | Leaky ReLU, ELU |
Forward and Backpropagation
Forward Propagation
Forward propagation is the process of data flowing from input to output layer:
def forward_propagation(X, weights, biases):
"""
Simple forward propagation implementation
"""
activations = [X]
for i in range(len(weights)):
z = np.dot(activations[-1], weights[i]) + biases[i]
if i < len(weights) - 1:
a = relu(z)
else:
a = softmax(z)
activations.append(a)
return activations
Forward propagation flow:
- Input data enters input layer
- Compute weighted sum: z = Wx + b
- Apply activation function: a = f(z)
- Pass to next layer
- Repeat until output layer
Backpropagation
Backpropagation computes gradients and updates weights based on the chain rule:
def backpropagation(y_true, activations, weights):
"""
Backpropagation gradient computation
"""
m = y_true.shape[0]
gradients_w = []
gradients_b = []
delta = activations[-1] - y_true
for i in range(len(weights) - 1, -1, -1):
dW = np.dot(activations[i].T, delta) / m
db = np.sum(delta, axis=0) / m
gradients_w.insert(0, dW)
gradients_b.insert(0, db)
if i > 0:
delta = np.dot(delta, weights[i].T) * relu_derivative(activations[i])
return gradients_w, gradients_b
Gradient Descent Optimization
Update weights using computed gradients:
def gradient_descent(weights, biases, grad_w, grad_b, learning_rate):
"""
Gradient descent parameter update
"""
for i in range(len(weights)):
weights[i] -= learning_rate * grad_w[i]
biases[i] -= learning_rate * grad_b[i]
return weights, biases
Loss Functions and Optimizers
Common Loss Functions
Mean Squared Error (MSE) - Regression tasks
def mse_loss(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
Cross-Entropy Loss - Classification tasks
def cross_entropy_loss(y_true, y_pred):
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred))
Popular Optimizers
SGD (Stochastic Gradient Descent)
w = w - learning_rate * gradient
Momentum
v = momentum * v - learning_rate * gradient
w = w + v
Adam (Adaptive Moment Estimation)
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient ** 2
w = w - learning_rate * m / (sqrt(v) + epsilon)
| Optimizer | Characteristics | Use Cases |
|---|---|---|
| SGD | Simple and stable | Large-scale data |
| Momentum | Accelerates convergence | Local optima present |
| Adam | Adaptive learning rate | Default choice |
| AdamW | Weight decay | Large model training |
Common Network Architectures
Convolutional Neural Networks (CNN)
CNNs are designed for processing grid-like data such as images:
PyTorch Implementation:
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 8 * 8, 128),
nn.ReLU(),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
Applications: Image classification, object detection, face recognition
Recurrent Neural Networks (RNN)
RNNs are designed for processing sequential data:
PyTorch Implementation:
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleRNN, self).__init__()
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
h0 = torch.zeros(1, x.size(0), self.hidden_size)
out, _ = self.rnn(x, h0)
out = self.fc(out[:, -1, :])
return out
LSTM Variant: Solves long-term dependency problems
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
Applications: Text generation, speech recognition, time series prediction
Transformer
Transformer is based on self-attention mechanism and forms the foundation of modern large models:
PyTorch Implementation:
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super(TransformerBlock, self).__init__()
self.attention = nn.MultiheadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
attn_output, _ = self.attention(x, x, x)
x = self.norm1(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Applications: GPT, BERT, machine translation, text generation
TensorFlow Practice Examples
Building a Simple Neural Network
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(784,)),
keras.layers.Dropout(0.2),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(
x_train, y_train,
epochs=10,
batch_size=32,
validation_split=0.2
)
Building a CNN
cnn_model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
Preventing Overfitting
Overfitting is a common problem in neural network training where models perform well on training data but poorly on test data.
Common Regularization Techniques
Dropout: Randomly drop neurons
nn.Dropout(p=0.5)
L2 Regularization: Limit weight magnitude
optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-4)
Early Stopping: Monitor validation loss
early_stopping = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=5,
restore_best_weights=True
)
Data Augmentation: Expand training data
transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2)
])
Tool Recommendations
The following tools can improve efficiency during neural network development and AI learning:
- JSON Formatter - Format model configuration files and training logs
- Base64 Encoder/Decoder - Handle model weights and embedding vector encoding
- Text Diff Tool - Compare different model configurations
- Random Data Generator - Generate test datasets
- Number Base Converter - Understand binary weight representations
Summary
Key points about neural networks:
- Artificial Neurons: Simulate biological neurons, processing information through weighted sums and activation functions
- Network Structure: Input layer receives data, hidden layers extract features, output layer produces predictions
- Activation Functions: ReLU is the default choice, Sigmoid for binary classification output
- Forward Propagation: The computational process of data flowing from input to output
- Backpropagation: Compute gradients based on chain rule to update network weights
- Common Architectures: CNN for images, RNN for sequences, Transformer as the foundation of modern large models
Mastering neural network fundamentals is the first step into the deep learning field, building a solid foundation for learning more complex model architectures and applications.
FAQ
What is the difference between neural networks and deep learning?
Deep learning is a subset of neural networks, specifically referring to neural networks with multiple hidden layers (deep layers). Traditional neural networks may have only one or two hidden layers, while deep learning networks typically have dozens or even hundreds of layers. Deep networks can learn more complex feature hierarchies and have achieved breakthrough progress in image, speech, and natural language processing.
Why do neural networks need activation functions?
Without activation functions, regardless of how many layers a network has, the entire network would be equivalent to a single linear transformation, unable to learn complex non-linear patterns. Activation functions introduce non-linearity, enabling neural networks to approximate arbitrarily complex functions, which is key to their powerful expressive capability.
How do I choose the number of layers and neurons in a neural network?
This is a hyperparameter tuning problem that requires experimentation. General principles: 1) Start with a simple network and gradually increase complexity; 2) Hidden layer neuron count is typically between input and output dimensions; 3) Use validation sets to evaluate different configurations; 4) Be aware of overfitting risks—complex networks need more data and regularization.
What are vanishing and exploding gradients?
Vanishing gradients occur when gradients become progressively smaller during backpropagation, causing shallow layer weights to barely update. Exploding gradients are the opposite, where gradients grow layer by layer causing numerical overflow. Solutions include: using ReLU activation, batch normalization, residual connections, proper weight initialization, and gradient clipping.
What tasks are CNN and RNN suitable for respectively?
CNNs excel at processing data with spatial structure, such as images and videos, because convolution operations effectively extract local features while maintaining translation invariance. RNNs excel at processing sequential data, such as text, speech, and time series, because recurrent structures can remember historical information. Today, Transformers have surpassed traditional CNNs and RNNs in many tasks.
How do I determine if a model is overfitting?
Typical signs of overfitting include training loss continuing to decrease while validation loss starts increasing, or training accuracy being much higher than validation accuracy. You can diagnose by plotting learning curves: if there's a large gap between training and validation curves, overfitting exists. Solutions include adding more data, using regularization, and reducing model complexity.