Classification and technical indicators of large models

Explore the classification of big models and their technical indicators in depth to understand the latest progress in the field of AI.
Core content:
1. Classification methods of big models, including application fields, model architecture, training paradigms and functional types
2. Core technical indicators of big models, such as parameter quantity, model architecture details, training data and computing resources
3. Performance evaluation indicators, including task performance, general indicators and test data
1. Classification of large models
1. By application field
General large model : suitable for a variety of tasks (such as the GPT series, PaLM), with cross-domain language understanding and generation capabilities.
Vertical big model : optimized for specific fields (such as medical, finance, and law), such as Baichuan Intelligence’s medical big model.
Multimodal large models : integrating multiple input forms such as text, images, and speech (such as the multimodal version of DeepSeek).
2. Divided by model architecture
Dense Models : Fully connected parameter structures, such as GPT-3 and BERT.
Sparse Models : such as the Mixture of Experts ( MoE) model, which improves efficiency by dynamically activating some parameters (such as DeepSeek, Kimi).
Retrieval Enhanced Generation ( RAG) : Combines retrieval and generation modules to improve knowledge accuracy and real-time performance (such as the ChatPDF system).
3. Classification by training paradigm
Pre-training + fine-tuning : such as BERT, which is based on large-scale pre-training and then adjusted for specific tasks.
Prompt -based Learning : Models are driven by natural language instructions (e.g., GPT-3, ChatGPT) without explicit fine-tuning.
Reinforcement Learning Optimization ( RLHF) : Incorporating human feedback to adjust generated content (e.g., InstructGPT, DeepSeek).
4. Classification by functional type
Generative models : focus on text generation (such as GPT and PaLM).
Comprehension model : focuses on semantic analysis and classification (such as BERT).
Reasoning model : has complex logical reasoning capabilities (such as DeepSeek through long thought chain optimization).
2. Core Technical Indicators of Large Models
1. Model scale related indicators
Parameters
The total number of trainable parameters of the model, usually measured in 100 million (100M), 1 billion (B), 10 billion (10B), 100 billion (100B), or 10 trillion (T). For example: GPT-3 (175B), PaLM-2 (340B), Llama 2 (7B-70B). The more parameters, the larger the model capacity, but the higher the training and inference costs.
Model architecture details
Layers : The number of layers of the Transformer (e.g. 12, 24, 96).
Attention Heads : The number of heads of the multi-head attention mechanism in each layer (e.g. 16 heads, 32 heads).
Hidden Dimension : The number of neurons in each layer (e.g. 1024, 4096).
Embedding Size : The dimension of the input word vector.
2. Training data and computing resources
Amount of training data
The scale of pre-training data is usually measured in terms of the number of tokens (such as 1T tokens) or data volume (such as TB level).
The diversity and quality of data sources (e.g., multi-language, multi-domain) are also key.
Computing resource consumption
Training time : The total time to complete training using the GPU/TPU cluster (e.g., thousands of hours).
Computing power requirements : usually expressed in FLOPs (floating point operations). For example, GPT-3 training requires about 3.14e23 FLOPs.
Hardware scale : the number of GPUs/TPUs used (e.g., thousands of chips).
Training cost
Power consumption, hardware rental or purchase costs (e.g., millions of dollars).
3. Performance Evaluation Metrics
Task performance
General indicators :
Perplexity : A measure of the predictive power of a language model (lower is better).
Accuracy , F1 score : used for classification or generation tasks.
Domain-specific indicators :
BLEU (machine translation), ROUGE (text summarization), GLUE/SuperGLUE (natural language understanding benchmarks).
Few-shot/Zero-shot Learning : The model’s ability to generalize with a small number of or zero samples.
Inference efficiency
Latency : The time required for a single inference (e.g., milliseconds).
Throughput : The number of requests processed per unit time (e.g. 100 requests per second).
Video memory usage : GPU video memory requirement during inference (e.g. 10GB).
4. Energy consumption and deployment indicators
Energy efficiency ratio
The ratio of unit performance (such as the number of tokens processed per second) to energy consumption (watts) is particularly important for edge deployment.
Model compression and optimization
Quantization : The impact of model parameter accuracy (such as FP32→INT8) on performance.
Pruning : Improves model size and speed by removing redundant parameters.
Distillation : The effect of a small model inheriting knowledge from a large model.
5. Other key indicators
Robustness
Resistance to adversarial examples and input noise.
Stability of multi-language and multi-domain tasks.
Fairness and safety
Bias : The degree of bias in the model output, such as gender and race.
Toxicity : The probability of generating harmful content.
Interpretability : traceability of model decisions (e.g., attention visualization).
Ecosystem Support
Adaptability to open source frameworks (such as Hugging Face and PyTorch).
Availability of community toolchains and pre-trained models.
Typical large model indicator example
Model | Parameter quantity | Amount of training data | Training computing power ( FLOPs) | Hardware scale | Typical task performance (such as MMLU accuracy) |
GPT-4 | ~1.8T* | ~13T tokens | ~2e25 | 25,000+ GPU | 86.4% (MMLU) |
PaLM-2 | 340B | 3.6T tokens | ~3e24 | TPU v4 Pod | 85.4% (MMLU) |
Llama 2-70B | 70B | 2T tokens | ~3e23 | 3,000+ GPU | 68.9% (MMLU) |
Summarize
III. Typical Assessment Benchmarks and Tools
General Ability Assessment
MMLU : Testing multi-task language understanding ability.
HellaSwag : Assessing commonsense reasoning and sentence completion abilities.
TruthfulQA : Detecting the authenticity of generated content.
Industry-specific reviews
Medical field: focus on diagnostic accuracy and compliance with medication recommendations.
Financial field: focus on logical reasoning and numerical calculation ability (such as financial report analysis).
Open Source Tools
SuperCLUE : A comprehensive evaluation benchmark for Chinese large models.
RAGAS : Evaluating contextual relevance for retrieval-augmented generation systems.
IV. Future Trends and Challenges
Efficient architecture innovation : such as MoE model and sparsification technology to reduce computing power requirements.
Synthetic data optimization : Generate high-quality training data through the model itself (such as DeepSeek’s long thought chain strategy).
Enhanced interpretability : Combining attention mechanism visualization with natural language explanation to improve model transparency.
Multimodal fusion : Promote unified modeling of text, images, and videos (such as GPT-4V)