TL;DR

Neural networks are computational models inspired by the human brain's neuron connections, forming the foundation of modern deep learning and artificial intelligence. This guide covers biological vs artificial neurons, network architecture (input, hidden, output layers), activation functions (ReLU, Sigmoid, Tanh), forward and backpropagation algorithms, loss functions and optimizers, and mainstream architectures like CNN, RNN, and Transformer, with practical PyTorch and TensorFlow code examples.

Introduction

Neural networks are among the most important technologies in artificial intelligence. From image recognition to natural language processing, from autonomous driving to medical diagnosis, neural networks are transforming our world. Understanding how neural networks work not only helps you better utilize AI tools but also builds a solid foundation for deeper exploration of deep learning.

In this guide, you will learn:

  • Comparison between biological and artificial neurons
  • Basic structure and mathematical principles of neural networks
  • Characteristics and selection criteria for common activation functions
  • How forward propagation and backpropagation work
  • The role of loss functions and optimizers
  • Mainstream architectures: CNN, RNN, Transformer
  • Practical code with PyTorch and TensorFlow

From Biological to Artificial Neurons

How Biological Neurons Work

The human brain contains approximately 86 billion neurons, each connected to others through synapses, forming complex neural networks. The basic workflow of biological neurons:

  1. Dendrites receive signals from other neurons
  2. Cell body integrates and processes signals
  3. When signal strength exceeds a threshold, the neuron activates
  4. Axon transmits signals to the next neuron
graph LR subgraph "Biological Neuron" D1[Dendrite 1] --> CB[Cell Body] D2[Dendrite 2] --> CB D3[Dendrite 3] --> CB CB --> |Threshold Exceeded| A[Axon] A --> S[Synapse] end

Mathematical Model of Artificial Neurons

Artificial neurons (also called perceptrons) simulate how biological neurons work:

graph LR subgraph "Artificial Neuron" X1[x₁] --> |w₁| SUM((Σ)) X2[x₂] --> |w₂| SUM X3[x₃] --> |w₃| SUM B[Bias b] --> SUM SUM --> ACT[Activation f] ACT --> Y[Output y] end

Mathematical expression:

code
y = f(w₁x₁ + w₂x₂ + w₃x₃ + b) = f(Σwᵢxᵢ + b)

Where:

  • xᵢ: Input features
  • wᵢ: Weights, representing the importance of each input
  • b: Bias, adjusting the activation threshold
  • f: Activation function, introducing non-linearity

Biological vs Artificial Neurons

Property Biological Neuron Artificial Neuron
Signal Type Electrochemical Numerical
Connection Synapse Weight
Activation Threshold Trigger Activation Function
Learning Synaptic Plasticity Gradient Descent
Processing Speed Milliseconds Nanoseconds
Power Consumption ~20 Watts Hundreds to Thousands of Watts

Basic Neural Network Structure

Neural networks consist of multiple layers of neurons, with each layer fully connected to adjacent layers.

Three-Layer Basic Architecture

graph LR subgraph "Input Layer" I1((x₁)) I2((x₂)) I3((x₃)) end subgraph "Hidden Layer" H1((h₁)) H2((h₂)) H3((h₃)) H4((h₄)) end subgraph "Output Layer" O1((y₁)) O2((y₂)) end I1 --> H1 I1 --> H2 I1 --> H3 I1 --> H4 I2 --> H1 I2 --> H2 I2 --> H3 I2 --> H4 I3 --> H1 I3 --> H2 I3 --> H3 I3 --> H4 H1 --> O1 H1 --> O2 H2 --> O1 H2 --> O2 H3 --> O1 H3 --> O2 H4 --> O1 H4 --> O2

Role of Each Layer

Input Layer

  • Receives raw data
  • Number of neurons equals feature dimensions
  • No computation, only data passing

Hidden Layer

  • Extracts and transforms features
  • Can have multiple layers (deep learning)
  • Number of neurons is a hyperparameter

Output Layer

  • Produces final predictions
  • Classification: neurons equal number of classes
  • Regression: typically one neuron

Deep Neural Networks

When the number of hidden layers increases, the network becomes a Deep Neural Network (DNN):

code
Input Layer → Hidden Layer 1 → Hidden Layer 2 → ... → Hidden Layer N → Output Layer

Deep networks can learn more complex feature hierarchies:

  • Shallow layers: Learn simple features (edges, textures)
  • Middle layers: Learn combined features (shapes, parts)
  • Deep layers: Learn abstract features (objects, concepts)

Activation Functions Explained

Activation functions introduce non-linearity to neural networks, enabling them to learn complex patterns.

ReLU (Rectified Linear Unit)

ReLU is currently the most commonly used activation function:

python
def relu(x):
    return max(0, x)
code
f(x) = max(0, x)

Advantages:

  • Simple and efficient computation
  • Mitigates vanishing gradient problem
  • Sparse activation improves efficiency

Disadvantages:

  • Dead ReLU problem (neurons permanently inactive)
  • Non-zero centered output

Sigmoid

Sigmoid compresses output to the (0, 1) range:

python
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))
code
f(x) = 1 / (1 + e^(-x))

Advantages:

  • Bounded output range
  • Suitable for binary classification output

Disadvantages:

  • Vanishing gradient problem
  • Non-zero centered output
  • Relatively complex computation

Tanh (Hyperbolic Tangent)

Tanh compresses output to the (-1, 1) range:

python
def tanh(x):
    return np.tanh(x)
code
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Advantages:

  • Zero-centered output
  • Stronger gradients than Sigmoid

Disadvantages:

  • Still suffers from vanishing gradient

Activation Function Selection Guide

Scenario Recommended Activation
Hidden layers (default) ReLU or variants
Binary classification output Sigmoid
Multi-class output Softmax
RNN hidden layers Tanh
Prevent Dead ReLU Leaky ReLU, ELU

Forward and Backpropagation

Forward Propagation

Forward propagation is the process of data flowing from input to output layer:

python
def forward_propagation(X, weights, biases):
    """
    Simple forward propagation implementation
    """
    activations = [X]
    
    for i in range(len(weights)):
        z = np.dot(activations[-1], weights[i]) + biases[i]
        
        if i < len(weights) - 1:
            a = relu(z)
        else:
            a = softmax(z)
        
        activations.append(a)
    
    return activations

Forward propagation flow:

  1. Input data enters input layer
  2. Compute weighted sum: z = Wx + b
  3. Apply activation function: a = f(z)
  4. Pass to next layer
  5. Repeat until output layer

Backpropagation

Backpropagation computes gradients and updates weights based on the chain rule:

graph RL subgraph "Backpropagation Flow" L[Loss Function] --> |∂L/∂y| O[Output Layer] O --> |∂L/∂h| H[Hidden Layer] H --> |∂L/∂x| I[Input Layer] end
python
def backpropagation(y_true, activations, weights):
    """
    Backpropagation gradient computation
    """
    m = y_true.shape[0]
    gradients_w = []
    gradients_b = []
    
    delta = activations[-1] - y_true
    
    for i in range(len(weights) - 1, -1, -1):
        dW = np.dot(activations[i].T, delta) / m
        db = np.sum(delta, axis=0) / m
        
        gradients_w.insert(0, dW)
        gradients_b.insert(0, db)
        
        if i > 0:
            delta = np.dot(delta, weights[i].T) * relu_derivative(activations[i])
    
    return gradients_w, gradients_b

Gradient Descent Optimization

Update weights using computed gradients:

python
def gradient_descent(weights, biases, grad_w, grad_b, learning_rate):
    """
    Gradient descent parameter update
    """
    for i in range(len(weights)):
        weights[i] -= learning_rate * grad_w[i]
        biases[i] -= learning_rate * grad_b[i]
    
    return weights, biases

Loss Functions and Optimizers

Common Loss Functions

Mean Squared Error (MSE) - Regression tasks

python
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

Cross-Entropy Loss - Classification tasks

python
def cross_entropy_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred))

SGD (Stochastic Gradient Descent)

python
w = w - learning_rate * gradient

Momentum

python
v = momentum * v - learning_rate * gradient
w = w + v

Adam (Adaptive Moment Estimation)

python
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient ** 2
w = w - learning_rate * m / (sqrt(v) + epsilon)
Optimizer Characteristics Use Cases
SGD Simple and stable Large-scale data
Momentum Accelerates convergence Local optima present
Adam Adaptive learning rate Default choice
AdamW Weight decay Large model training

Common Network Architectures

Convolutional Neural Networks (CNN)

CNNs are designed for processing grid-like data such as images:

graph LR subgraph "CNN Architecture" I[Input Image] --> C1[Conv Layer] C1 --> P1[Pooling] P1 --> C2[Conv Layer] C2 --> P2[Pooling] P2 --> F[Flatten] F --> FC[Fully Connected] FC --> O[Output] end

PyTorch Implementation:

python
import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Applications: Image classification, object detection, face recognition

Recurrent Neural Networks (RNN)

RNNs are designed for processing sequential data:

graph LR subgraph "RNN Unrolled" X1[x₁] --> H1[h₁] H1 --> Y1[y₁] H1 --> H2[h₂] X2[x₂] --> H2 H2 --> Y2[y₂] H2 --> H3[h₃] X3[x₃] --> H3 H3 --> Y3[y₃] end

PyTorch Implementation:

python
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])
        return out

LSTM Variant: Solves long-term dependency problems

python
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)

Applications: Text generation, speech recognition, time series prediction

Transformer

Transformer is based on self-attention mechanism and forms the foundation of modern large models:

graph TB subgraph "Transformer Encoder" I[Input Embedding] --> PE[Positional Encoding] PE --> SA[Self-Attention] SA --> AN1["Add & Norm"] AN1 --> FF[Feed Forward] FF --> AN2["Add & Norm"] end

PyTorch Implementation:

python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(TransformerBlock, self).__init__()
        self.attention = nn.MultiheadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

Applications: GPT, BERT, machine translation, text generation

TensorFlow Practice Examples

Building a Simple Neural Network

python
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

Building a CNN

python
cnn_model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

Preventing Overfitting

Overfitting is a common problem in neural network training where models perform well on training data but poorly on test data.

Common Regularization Techniques

Dropout: Randomly drop neurons

python
nn.Dropout(p=0.5)

L2 Regularization: Limit weight magnitude

python
optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-4)

Early Stopping: Monitor validation loss

python
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

Data Augmentation: Expand training data

python
transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2)
])

Tool Recommendations

The following tools can improve efficiency during neural network development and AI learning:

Summary

Key points about neural networks:

  1. Artificial Neurons: Simulate biological neurons, processing information through weighted sums and activation functions
  2. Network Structure: Input layer receives data, hidden layers extract features, output layer produces predictions
  3. Activation Functions: ReLU is the default choice, Sigmoid for binary classification output
  4. Forward Propagation: The computational process of data flowing from input to output
  5. Backpropagation: Compute gradients based on chain rule to update network weights
  6. Common Architectures: CNN for images, RNN for sequences, Transformer as the foundation of modern large models

Mastering neural network fundamentals is the first step into the deep learning field, building a solid foundation for learning more complex model architectures and applications.

FAQ

What is the difference between neural networks and deep learning?

Deep learning is a subset of neural networks, specifically referring to neural networks with multiple hidden layers (deep layers). Traditional neural networks may have only one or two hidden layers, while deep learning networks typically have dozens or even hundreds of layers. Deep networks can learn more complex feature hierarchies and have achieved breakthrough progress in image, speech, and natural language processing.

Why do neural networks need activation functions?

Without activation functions, regardless of how many layers a network has, the entire network would be equivalent to a single linear transformation, unable to learn complex non-linear patterns. Activation functions introduce non-linearity, enabling neural networks to approximate arbitrarily complex functions, which is key to their powerful expressive capability.

How do I choose the number of layers and neurons in a neural network?

This is a hyperparameter tuning problem that requires experimentation. General principles: 1) Start with a simple network and gradually increase complexity; 2) Hidden layer neuron count is typically between input and output dimensions; 3) Use validation sets to evaluate different configurations; 4) Be aware of overfitting risks—complex networks need more data and regularization.

What are vanishing and exploding gradients?

Vanishing gradients occur when gradients become progressively smaller during backpropagation, causing shallow layer weights to barely update. Exploding gradients are the opposite, where gradients grow layer by layer causing numerical overflow. Solutions include: using ReLU activation, batch normalization, residual connections, proper weight initialization, and gradient clipping.

What tasks are CNN and RNN suitable for respectively?

CNNs excel at processing data with spatial structure, such as images and videos, because convolution operations effectively extract local features while maintaining translation invariance. RNNs excel at processing sequential data, such as text, speech, and time series, because recurrent structures can remember historical information. Today, Transformers have surpassed traditional CNNs and RNNs in many tasks.

How do I determine if a model is overfitting?

Typical signs of overfitting include training loss continuing to decrease while validation loss starts increasing, or training accuracy being much higher than validation accuracy. You can diagnose by plotting learning curves: if there's a large gap between training and validation curves, overfitting exists. Solutions include adding more data, using regularization, and reducing model complexity.