Neural Network Complete Guide: From Biological Neurons to Deep Learning Architectures

2026-02-21 - QubitTool Team

TL;DR

Neural networks are computational models inspired by the human brain's neuron connections, forming the foundation of modern deep learning and artificial intelligence. This guide covers biological vs artificial neurons, network architecture (input, hidden, output layers), activation functions (ReLU, Sigmoid, Tanh), forward and backpropagation algorithms, loss functions and optimizers, and mainstream architectures like CNN, RNN, and Transformer, with practical PyTorch and TensorFlow code examples.

Introduction

Neural networks are among the most important technologies in artificial intelligence. From image recognition to natural language processing, from autonomous driving to medical diagnosis, neural networks are transforming our world. Understanding how neural networks work not only helps you better utilize AI tools but also builds a solid foundation for deeper exploration of deep learning.

In this guide, you will learn:

Comparison between biological and artificial neurons
Basic structure and mathematical principles of neural networks
Characteristics and selection criteria for common activation functions
How forward propagation and backpropagation work
The role of loss functions and optimizers
Mainstream architectures: CNN, RNN, Transformer
Practical code with PyTorch and TensorFlow

From Biological to Artificial Neurons

How Biological Neurons Work

The human brain contains approximately 86 billion neurons, each connected to others through synapses, forming complex neural networks. The basic workflow of biological neurons:

Dendrites receive signals from other neurons
Cell body integrates and processes signals
When signal strength exceeds a threshold, the neuron activates
Axon transmits signals to the next neuron

graph LR subgraph "Biological Neuron" D1[Dendrite 1] --> CB[Cell Body] D2[Dendrite 2] --> CB D3[Dendrite 3] --> CB CB --> |Threshold Exceeded| A[Axon] A --> S[Synapse] end

Mathematical Model of Artificial Neurons

Artificial neurons (also called perceptrons) simulate how biological neurons work:

graph LR subgraph "Artificial Neuron" X1[x₁] --> |w₁| SUM((Σ)) X2[x₂] --> |w₂| SUM X3[x₃] --> |w₃| SUM B[Bias b] --> SUM SUM --> ACT[Activation f] ACT --> Y[Output y] end

Mathematical expression:

code

y = f(w₁x₁ + w₂x₂ + w₃x₃ + b) = f(Σwᵢxᵢ + b)

Where:

xᵢ: Input features
wᵢ: Weights, representing the importance of each input
b: Bias, adjusting the activation threshold
f: Activation function, introducing non-linearity

Biological vs Artificial Neurons

Property	Biological Neuron	Artificial Neuron
Signal Type	Electrochemical	Numerical
Connection	Synapse	Weight
Activation	Threshold Trigger	Activation Function
Learning	Synaptic Plasticity	Gradient Descent
Processing Speed	Milliseconds	Nanoseconds
Power Consumption	~20 Watts	Hundreds to Thousands of Watts

Basic Neural Network Structure

Neural networks consist of multiple layers of neurons, with each layer fully connected to adjacent layers.

Three-Layer Basic Architecture

graph LR subgraph "Input Layer" I1((x₁)) I2((x₂)) I3((x₃)) end subgraph "Hidden Layer" H1((h₁)) H2((h₂)) H3((h₃)) H4((h₄)) end subgraph "Output Layer" O1((y₁)) O2((y₂)) end I1 --> H1 I1 --> H2 I1 --> H3 I1 --> H4 I2 --> H1 I2 --> H2 I2 --> H3 I2 --> H4 I3 --> H1 I3 --> H2 I3 --> H3 I3 --> H4 H1 --> O1 H1 --> O2 H2 --> O1 H2 --> O2 H3 --> O1 H3 --> O2 H4 --> O1 H4 --> O2

Role of Each Layer

Input Layer

Receives raw data
Number of neurons equals feature dimensions
No computation, only data passing

Hidden Layer

Extracts and transforms features
Can have multiple layers (deep learning)
Number of neurons is a hyperparameter

Output Layer

Produces final predictions
Classification: neurons equal number of classes
Regression: typically one neuron

Deep Neural Networks

When the number of hidden layers increases, the network becomes a Deep Neural Network (DNN):

code

Input Layer → Hidden Layer 1 → Hidden Layer 2 → ... → Hidden Layer N → Output Layer

Deep networks can learn more complex feature hierarchies:

Shallow layers: Learn simple features (edges, textures)
Middle layers: Learn combined features (shapes, parts)
Deep layers: Learn abstract features (objects, concepts)

Activation Functions Explained

Activation functions introduce non-linearity to neural networks, enabling them to learn complex patterns.

ReLU (Rectified Linear Unit)

ReLU is currently the most commonly used activation function:

python

def relu(x):
    return max(0, x)

code

f(x) = max(0, x)

Advantages:

Simple and efficient computation
Mitigates vanishing gradient problem
Sparse activation improves efficiency

Disadvantages:

Dead ReLU problem (neurons permanently inactive)
Non-zero centered output

Sigmoid

Sigmoid compresses output to the (0, 1) range:

python

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

code

f(x) = 1 / (1 + e^(-x))

Advantages:

Bounded output range
Suitable for binary classification output

Disadvantages:

Vanishing gradient problem
Non-zero centered output
Relatively complex computation

Tanh (Hyperbolic Tangent)

Tanh compresses output to the (-1, 1) range:

python

def tanh(x):
    return np.tanh(x)

code

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Advantages:

Zero-centered output
Stronger gradients than Sigmoid

Disadvantages:

Still suffers from vanishing gradient

Activation Function Selection Guide

Scenario	Recommended Activation
Hidden layers (default)	ReLU or variants
Binary classification output	Sigmoid
Multi-class output	Softmax
RNN hidden layers	Tanh
Prevent Dead ReLU	Leaky ReLU, ELU

Forward and Backpropagation

Forward Propagation

Forward propagation is the process of data flowing from input to output layer:

python

def forward_propagation(X, weights, biases):
    """
    Simple forward propagation implementation
    """
    activations = [X]
    
    for i in range(len(weights)):
        z = np.dot(activations[-1], weights[i]) + biases[i]
        
        if i < len(weights) - 1:
            a = relu(z)
        else:
            a = softmax(z)
        
        activations.append(a)
    
    return activations

Forward propagation flow:

Input data enters input layer
Compute weighted sum: z = Wx + b
Apply activation function: a = f(z)
Pass to next layer
Repeat until output layer

Backpropagation

Backpropagation computes gradients and updates weights based on the chain rule:

graph RL subgraph "Backpropagation Flow" L[Loss Function] --> |∂L/∂y| O[Output Layer] O --> |∂L/∂h| H[Hidden Layer] H --> |∂L/∂x| I[Input Layer] end

python

def backpropagation(y_true, activations, weights):
    """
    Backpropagation gradient computation
    """
    m = y_true.shape[0]
    gradients_w = []
    gradients_b = []
    
    delta = activations[-1] - y_true
    
    for i in range(len(weights) - 1, -1, -1):
        dW = np.dot(activations[i].T, delta) / m
        db = np.sum(delta, axis=0) / m
        
        gradients_w.insert(0, dW)
        gradients_b.insert(0, db)
        
        if i > 0:
            delta = np.dot(delta, weights[i].T) * relu_derivative(activations[i])
    
    return gradients_w, gradients_b

Gradient Descent Optimization

Update weights using computed gradients:

python

def gradient_descent(weights, biases, grad_w, grad_b, learning_rate):
    """
    Gradient descent parameter update
    """
    for i in range(len(weights)):
        weights[i] -= learning_rate * grad_w[i]
        biases[i] -= learning_rate * grad_b[i]
    
    return weights, biases

Loss Functions and Optimizers

Common Loss Functions

Mean Squared Error (MSE) - Regression tasks

python

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

Cross-Entropy Loss - Classification tasks

python

def cross_entropy_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred))

Popular Optimizers

SGD (Stochastic Gradient Descent)

python

w = w - learning_rate * gradient

Momentum

python

v = momentum * v - learning_rate * gradient
w = w + v

Adam (Adaptive Moment Estimation)

python

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient ** 2
w = w - learning_rate * m / (sqrt(v) + epsilon)

Optimizer	Characteristics	Use Cases
SGD	Simple and stable	Large-scale data
Momentum	Accelerates convergence	Local optima present
Adam	Adaptive learning rate	Default choice
AdamW	Weight decay	Large model training

Common Network Architectures

Convolutional Neural Networks (CNN)

CNNs are designed for processing grid-like data such as images:

graph LR subgraph "CNN Architecture" I[Input Image] --> C1[Conv Layer] C1 --> P1[Pooling] P1 --> C2[Conv Layer] C2 --> P2[Pooling] P2 --> F[Flatten] F --> FC[Fully Connected] FC --> O[Output] end

PyTorch Implementation:

python

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Applications: Image classification, object detection, face recognition

Recurrent Neural Networks (RNN)

RNNs are designed for processing sequential data:

graph LR subgraph "RNN Unrolled" X1[x₁] --> H1[h₁] H1 --> Y1[y₁] H1 --> H2[h₂] X2[x₂] --> H2 H2 --> Y2[y₂] H2 --> H3[h₃] X3[x₃] --> H3 H3 --> Y3[y₃] end

PyTorch Implementation:

python

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])
        return out

LSTM Variant: Solves long-term dependency problems

python

self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)

Applications: Text generation, speech recognition, time series prediction

Transformer

Transformer is based on self-attention mechanism and forms the foundation of modern large models:

graph TB subgraph "Transformer Encoder" I[Input Embedding] --> PE[Positional Encoding] PE --> SA[Self-Attention] SA --> AN1["Add & Norm"] AN1 --> FF[Feed Forward] FF --> AN2["Add & Norm"] end

PyTorch Implementation:

python

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(TransformerBlock, self).__init__()
        self.attention = nn.MultiheadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

Applications: GPT, BERT, machine translation, text generation

TensorFlow Practice Examples

Building a Simple Neural Network

python

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

Building a CNN

python

cnn_model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

Preventing Overfitting

Overfitting is a common problem in neural network training where models perform well on training data but poorly on test data.

Common Regularization Techniques

Dropout: Randomly drop neurons

python

nn.Dropout(p=0.5)

L2 Regularization: Limit weight magnitude

python

optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-4)

Early Stopping: Monitor validation loss

python

early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

Data Augmentation: Expand training data

python

transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2)
])

Tool Recommendations

The following tools can improve efficiency during neural network development and AI learning:

JSON Formatter - Format model configuration files and training logs
Base64 Encoder/Decoder - Handle model weights and embedding vector encoding
Text Diff Tool - Compare different model configurations
Random Data Generator - Generate test datasets
Number Base Converter - Understand binary weight representations

Summary

Key points about neural networks:

Artificial Neurons: Simulate biological neurons, processing information through weighted sums and activation functions
Network Structure: Input layer receives data, hidden layers extract features, output layer produces predictions
Activation Functions: ReLU is the default choice, Sigmoid for binary classification output
Forward Propagation: The computational process of data flowing from input to output
Backpropagation: Compute gradients based on chain rule to update network weights
Common Architectures: CNN for images, RNN for sequences, Transformer as the foundation of modern large models

Mastering neural network fundamentals is the first step into the deep learning field, building a solid foundation for learning more complex model architectures and applications.

FAQ

What is the difference between neural networks and deep learning?

Deep learning is a subset of neural networks, specifically referring to neural networks with multiple hidden layers (deep layers). Traditional neural networks may have only one or two hidden layers, while deep learning networks typically have dozens or even hundreds of layers. Deep networks can learn more complex feature hierarchies and have achieved breakthrough progress in image, speech, and natural language processing.

Why do neural networks need activation functions?

Without activation functions, regardless of how many layers a network has, the entire network would be equivalent to a single linear transformation, unable to learn complex non-linear patterns. Activation functions introduce non-linearity, enabling neural networks to approximate arbitrarily complex functions, which is key to their powerful expressive capability.

How do I choose the number of layers and neurons in a neural network?

This is a hyperparameter tuning problem that requires experimentation. General principles: 1) Start with a simple network and gradually increase complexity; 2) Hidden layer neuron count is typically between input and output dimensions; 3) Use validation sets to evaluate different configurations; 4) Be aware of overfitting risks—complex networks need more data and regularization.

What are vanishing and exploding gradients?

Vanishing gradients occur when gradients become progressively smaller during backpropagation, causing shallow layer weights to barely update. Exploding gradients are the opposite, where gradients grow layer by layer causing numerical overflow. Solutions include: using ReLU activation, batch normalization, residual connections, proper weight initialization, and gradient clipping.

What tasks are CNN and RNN suitable for respectively?

CNNs excel at processing data with spatial structure, such as images and videos, because convolution operations effectively extract local features while maintaining translation invariance. RNNs excel at processing sequential data, such as text, speech, and time series, because recurrent structures can remember historical information. Today, Transformers have surpassed traditional CNNs and RNNs in many tasks.

How do I determine if a model is overfitting?

Typical signs of overfitting include training loss continuing to decrease while validation loss starts increasing, or training accuracy being much higher than validation accuracy. You can diagnose by plotting learning curves: if there's a large gap between training and validation curves, overfitting exists. Solutions include adding more data, using regularization, and reducing model complexity.

Previous:Deep Learning Fundamentals: Neural Networks, Training, and Modern Architectures

Next:Vector Embeddings Complete Guide: From Principles to Practice [2026]