Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

One-click GRPO fine-tuning Qwen3 latest model on single card 4090

Written by

Silas Grey

Updated on:June-15th-2025

Model and dataset download

Because of the domestic network environment, I have a fear of connection timeout, so the first thing is to download what needs to be downloaded, and don't download dynamically during operation. The model and dataset address used in this article are:

https://modelscope.cn/models/Qwen/Qwen3-4B-Base
https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini
https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed

Download command:

modelscope download --model Qwen/Qwen3-4B-Base --revision master --local_dir /models/Qwen/Qwen3-4B-Basehuggingface-cli download --resume-download --repo-type dataset unsloth/OpenMathReasoning-mini --local-dir unsloth/OpenMathReasoning-minihuggingface-cli download --resume-download --repo-type dataset open-r1/DAPO-Math-17k-Processed --local-dir open-r1/DAPO-Math-17k-Processed

Start the container

# docker run --name unsloth0517 -itd --gpus '"device=4"' \  -v /data/ai/models:/models \  -v /data/ai/datasets:/datasets \  -v /data/ai/workspace/unsloth:/workspace \  unsloth:20250517_4cd5_cu121 bash
# docker exec -it unsloth0517 bashroot@ 1855d8235e1a:/home/unsloth # cd /workspace/scripts

The construction method of the Docker image unsloth:20250517_4cd5_cu121 is described in detail in the previous article: "Full steps for fine-tuning Qwen3-32B for a single RTX4090 card".

root@1855d8235e1a:/workspace/scripts# python unsloth-grpo-qwen3.py? Unsloth: Will patch your computer to enable 2x faster free finetuning.? Unsloth Zoo will now patch everything to make training faster! Traceback (most recent call last): File "/workspace/scripts/unsloth-grpo-qwen3.py", line 17, in <module> model, tokenizer = FastLanguageModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/unsloth/models/loader.py", line 138, in from_pretrained raise ImportError(ImportError: Unsloth: Please install vLLM before enabling `fast_inference`! You can do this in a terminal via `pip install vllm`

Because the unsloth:20250517_4cd5_cu121 container image does not contain vllm, this error is reported.

We can create another image containing VLLM based on the unsloth:20250517_4cd5_cu121 image. For simplicity, we will install VLLM directly in the container:

root@1855d8235e1a: /workspace/s scripts # export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/root@1855d8235e1a: /workspace/s scripts # pip install vllm. . .Successfully installed airportsdata- 20250224  annotated-types- 0 . 7.0  anyio- 4.9 . 0  astor- 0 . 8.1  blake3- 1.0 . 5  cachetools- 5.5 . 2  cloudpickle- 3.1 . 1  compressed-tensors- 0 . 9.3  cupy-cuda12x- 13.4 . 1  deprecated- 1.2 . 18  depyf- 0 . 18.0  diskcache- 5.6 . 3  einops- 0 . 8.1  email-validator- 2.2 . 0  fastapi- 0 . 115.12  fastapi-cli- 0 . 0 . 7  fastrlock- 0 . 8.3  gguf- 0.16.3  googleapis-common-protos- 1.70 . 0  h11- 0 . 16.0  hf-xet- 1.1 . 2  httpcore- 1.0 . 9  httptools- 0 . 6.4  httpx- 0 . 28.1  importlib_metadata- 8.0 . 0  interegular- 0 . 3.3  jinja2- 3.1 . 6  jiter- 0 . 10.0  lark- 1.2 . 2  llguidance- 0 . 7.22  llvmlite- 0 . 44.0  lm- format -enforcer- 0 . 10.11  mistral_common- 1.5 . 5  msgpack- 1.1 . 0  nest_asyncio- 1.6 . 0  numba- 0 . 61.2  nvidia-cublas-cu12- 12.4 . 5.8  nvidia-cuda-cupti-cu12- 12.4 . 127  nvidia-cuda-nvrtc-cu12- 12.4 . 127  nvidia-cuda-runtime-cu12- 12.4 . 127  nvidia-cufft-cu12- 11.2 . 1.3  nvidia-curand-cu12- 10.3. 5.147  nvidia-cusolver-cu12- 11.6 . 1.9  nvidia-cusparse-cu12- 12.3 . 1.170  nvidia-cusparselt-cu12- 0 . 6.2  nvidia-nvjitlink-cu12- 12.4 . 127  nvidia-nvtx-cu12- 12.4 . 127  openai- 1.81 . 0  opencv-python-headless- 4.11 . 0 . 86  opentelemetry-api- 1.26 . 0  opentelemetry-exporter-otlp- 1.26 . 0  opentelemetry-exporter-otlp-proto-common- 1.26 . 0  opentelemetry-exporter-otlp-proto-grpc- 1.26 . 0  opentelemetry-exporter-otlp-proto-http- 1.26 . 0  opentelemetry-proto- 1.26 . 0  opentelemetry-sdk- 1.26 . 0  opentelemetry-semantic-conventions- 0 . 47 b 0  opentelemetry-semantic-conventions-ai- 0 . 4.9  outlines- 0 . 1.11  outlines_core- 0 . 1.26  partial-json-parser- 0 . 2.1 . 1 .post5 pillow- 11.2 . 1  prometheus-fastapi-instrumentator- 7.1 . 0  prometheus_client- 0 . 22.0  py-cpuinfo- 9.0 . 0  pycountry- 24.6 . 1  pydantic- 2.11 . 4  pydantic-core- 2.33 . 2  python-dotenv- 1.1 . 0  python-json-logger- 3.3 . 0  python-multipart- 0 . 0 . 20  pyzmq- 26.4 . 0  ray- 2.46 . 0  rich-toolkit- 0 . 14.6  scipy- 1.15 . 3  shellingham- 1.5 . 4  sniffio- 1.3 . 1  starlette- 0 . 46.2  tiktoken- 0 .9.0  torch- 2.6 . 0  torchaudio- 2.6 . 0  torchvision- 0 . 21.0  triton- 3.2 . 0  typer- 0 . 15.4  typing-inspection- 0 . 4.1  uvicorn- 0 . 34.2  uvloop- 0 . 21.0  vllm- 0 . 8.5 .post1 watchfiles- 1.0 . 5  websockets - 15.0 . 1  wrapt - 1.17 . 2  xformers - 0 . 0 . 29 .post2 xgrammar - 0 . 1.18WARNING: Running pip as the  'root'  user can result in broken permissions  and  conflicting behavior with the  system  package  manager, possibly rendering your  system  unusable. It is recommended to  use  a virtual environment instead: https: //pip .pypa.io/warnings/venv. Use the --root-user-action option  if  you know what you are doing  and  want to suppress this warning.

Start training

After entering the container, make sure that the following directories are visible in the container:

Base model directory to be trained: /models/Qwen/Qwen3-4B-Base
Dataset 1: /datasets/unsloth/OpenMathReasoning-mini/data/cot-00000-of-00001.parquet
Dataset 2:/datasets/open-r1/DAPO-Math-17k-Processed/en/train-00000-of-00001.parquet
Working directory and training code file: /workspace/scripts/unsloth-grpo-qwen3.py

Execute the following code in the /workspace/scripts/ directory of the container to start training:

cat unsloth-grpo-qwen3.py > unsloth-grpo-qwen3.py.log && \ nohup python unsloth-grpo-qwen3.py >> unsloth-grpo-qwen3.py.log 2>&1 &

We intentionally flush the training code to the front of the training log to fix it. The advantage of doing this is that it is convenient for troubleshooting during code iteration.

The latest version of the training code unsloth-grpo-qwen3.py has optimized the parameter boundaries for the 4090 card with 24G video memory, and has made detailed comments. The code content is in the previous article: https://mp.weixin.qq.com/s/olblI2gE3HHDSEGnejGBrw. Friends who need it can use it by themselves. In order to quickly run the test, this article limits the parameters such as max_steps. The following is the log analysis corresponding to each stage of the training code execution:

Training log

Loading the model

? Unsloth: Will patch your computer  to  enable  2 x faster  free  finetuning.? Unsloth Zoo will now patch everything  to  make training faster !INFO  05-25  10 : 56 : 06  [importing.py: 53 ] Triton  module  has been replaced  with  a placeholder.INFO  05-25  10 : 56 : 06  [__init__.py: 239 ] Automatically detected platform cuda.=====  step1. Load the model  ============================================================================ (( ==== )) ==   Unsloth  2025.5.6 : Fast Qwen3 patching. Transformers:  4.51.3 . vLLM:  0.8.5 .post1.   \\    /|     NVIDIA GeForce RTX  4090.  Num GPUs  =  1.  Max memory:  23.65  GB. Platform: Linux.O ^ O /  \_ /  \ Torch:  2.6.0 + cu124. CUDA:  8.9 . CUDA Toolkit:  12.4 . Triton:  3.2.0\         /     Bfloat16  =  TRUE. FA [Xformers  =  0.0.29 .post2. FA2  =  False ] "-____-"      Free  license: http: // github.com / unslothai / unslothUnsloth: Fast downloading  is  enabled  -  ignore downloading bars which  are  red colored !Unsloth: vLLM loading  / models / Qwen / Qwen3 -4 B - Base  with  actual GPU utilization  =  68.76 %Unsloth: Your GPU has CUDA compute capability  8.9  with  VRAM  =  23.65  GB.Unsloth:  Using  conservativeness  =  1.0 . Chunked prefill tokens  =  2048.  Num Sequences  =  224.Unsloth: vLLM 's KV Cache can use up to 9.31 GB. Also swap space = 6 GB.INFO 05-25 10:59:17 [config.py:717] This model supports multiple tasks: {'embed ', ' generate ', ' score ', ' reward ', ' classify '}. Defaulting to ' generate '.INFO 05-25 10:59:17 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.INFO 05-25 10:59:17 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: . . .WARNING 05-25 10:59:17 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f19973bbd50>INFO 05-25 10:59:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0INFO 05-25 10:59:28 [cuda.py:221] Using Flash Attention backend on V1 engine.WARNING 05-25 10:59:28 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.INFO 05-25 10:59:28 [gpu_model_runner.py:1329] Starting to load model /models/Qwen/Qwen3-4B-Base...Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:02<00:01, 1.20s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.60s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.52s/it]
INFO 05-25 10:59:33 [loader.py:458] Loading weights took 4.81 secondsINFO 05-25 10:59:33 [punica_selector.py:18] Using PunicaWrapperGPU.INFO 05-25 10:59:33 [gpu_model_runner.py:1347] Model loading took 7.6334 GiB and 5.084545 secondsINFO 05-25 10:59:48 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/f7b249c75c/rank_0_0 for vLLM's torch.compileINFO  05-25  10:59:48 [  backends.py: 430 ] Dynamo bytecode   transform time :  14.75 sInductor Compilation:  100 %| ██████████ |  6 / 6  [ 00 : 01 < 00 : 00 ,   5.13 it / s, triton_poi_fused_add_mul_sub_5]INFO  05-25  10 : 59 : 53  [backends.py: 136 ] Cache the graph  of  shape  None  for  later use. . . Inductor Compilation  : 100 % |  5 / 5 INFO  05-25  11 : 00 : 39  [backends.py: 148 ] Compiling a graph  for  general shape takes  49.26  sINFO  05-25  11 : 03 : 14  [monitor.py: 33 ] torch.compile takes  64.01  s  in  totalINFO  05-25  11:03:18 [  kv_cache_utils.py: 634 ] GPU KV cache  size : 49 , 856  tokensINFO  05-25  11 : 03 : 18  [kv_cache_utils.py: 637 ] Maximum concurrency  for  2 , 048  tokens  per  request:  24.34 xINFO  05-25  11 : 04 : 19  [gpu_model_runner.py: 1686 ] Graph capturing finished  in  61  secs, took  3.94  GiBINFO  05-25  11 : 04 : 19  [core.py: 159 ] init engine (profile,  create  kv cache, warmup model) took  286.49  secondsUnsloth  2025.5.6  patched  36  layers  with  36  QKV layers,  36  O layers  and  36  MLP layers.

Model structure

For the loaded model, the structure after inserting LORA:

/models/Qwen/Qwen3-4B-Base does not have a padding token! Will use pad_token = <|vision_pad|>.model:PeftModelForCausalLM( (base_model): LoraModel( (model): Qwen3ForCausalLM( (model): Qwen3Model( (embed_tokens): Embedding(151936, 2560, padding_idx=151654) (layers): ModuleList( (0-35): 36 x Qwen3DecoderLayer( (self_attn): Qwen3Attention( (q_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B):ParameterDict() (lora_magnitude_vector): ModuleDict() ) (q_norm): Qwen3RMSNorm((128,), eps=1e-06) (k_norm): Qwen3RMSNorm((128,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): Qwen3MLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))

GRPO Dialogue Template

===== step2. Prepare GRPO dialogue template ================================================================ Dialog template output example: You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>What is 1+1?<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION><|endoftext|>What is 2+2?<start_working_out>

The format follows pre-fine-tuning

Before GRPO training, we first use the conversation data with the reasoning process to perform a simple SFT training on the original model to enable the model to have the ability to output in our reasoning format.

First, we need to clean the original data set and filter out the data suitable for format fine-tuning:

===== step3. The format follows the pre-fine-tuning ===================================================================----- Cleaned data set:      expected_answer ... generated_solution0 14 ...   < think > \nOkay, let's see. I need to solve the ...6 -2 ...   < think > \nOkay, so I need to find the value of ...9 18 ...   < think > \nOkay, so I need to solve the equation...13 2 ...   < think > \nOkay, so I need to evaluate the infin...17 30 ...   < think > \nAlright, so I need to find the larges...... ... ... ...19243 244 ...   < think > \nOkay, so I need to find the value of ...19245 1 ...   < think > \nOkay, so I have this problem where a ...19247 4 ...   < think > \nOkay, let's tackle this problem step ...19248 18 ...   < think > \nOkay, let's see. I need to find the n...19250 0.8960 ...   < think > \nOkay, so I need to find the probabili...
[7507 rows x 3 columns]

----- Output of the first   OpenMathReasoning data formatted  using the dialog template:You  are given a problem.Think  about the problem and provide your working out.Place  it between  < start_working_out >  and  < end_working_out > .Then , provide your solution between  < SOLUTION >  and  </ SOLUTION ><| endoftext |> Given  $\sqrt{x ^ 2 + 165 } - \sqrt{x ^ 2 - 52 } = 7 $ and  $x $  is  positive, find all possible values of  $x $ .< start_working_out > Okay ,  let 's see.  I  need to solve the equation  √ (x²  +  165 )  -  √ (x²  -  52 )  =  7 , and find all positive values of x.  Hmm , radicals can be tricky, but maybe  if  I  can eliminate the square roots by squaring both sides.  Let  me  try  that.
First ,  let  me write down the equation again to make sure  I  have it right:
√ (x²  +  165 )  -  √ (x²  -  52 )  =  7 .
Okay , so the idea  is  to isolate one of the radicals and then square both sides.  Let  me  try  moving the second radical to the other side:
√ (x²  +  165 )  =  7  +  √ (x²  -  52 ).
Now ,  if  I  square both sides, maybe  I  can  get  rid of the square roots.  Let 's  do  that:
( √ (x²  +  165 ))²  =  ( 7  +  √ (x²  -  52 ))².
Simplifying  the left side:
x²  +  165  =  49  +  14 √ (x²  -  52 )  +  ( √ (x²  -  52 ))².
The  right side  is  expanded using the formula (a  +  b)²  =  a²  +  2ab  +  b².  So  the right side becomes  7 ²  +  2 * 7 *√ (x²  -  52 )  +  ( √ (x²  -  52 ))², which  is  49  +  14 √ (x²  -  52 )  +  (x²  -  52 ).
So  putting it all together:
x²  +  165  =  49  +  14 √ (x²  -  52 )  +  x²  -  52 .
Hmm ,  let 's simplify the right side.  The  x² terms will cancel out, right ?  Let 's subtract x² from both sides:
165  =  49  +  14 √ (x²  -  52 )  -  52 .
Simplify  the constants on the right:
49  -  52  is  - 3 , so:
165  =  - 3  +  14 √ (x²  -  52 ).
Now , add  3  to both sides to isolate the radical term:
165  +  3  =  14√ (x²  - 52 ) . 
So  168  =  14√ (x²  - 52 ) . 
Divide  both sides by  14 :
168  /  14  =  √ (x²  -  52 ).
12  =  √ (x²  -  52 ).
Now , square both sides again to eliminate the square root:
12² =  x²  -  52 . 
144  =  x²  -  52 .
Add  52  to both sides:
144  +  52  =  x².
196  =  x².
So  x  =  √ 196  =  14 .
But  wait, since the problem states that x  is  positive, we only take the positive root.  So  x  =  14 .
But  hold on, when dealing with squaring equations, sometimes extraneous solutions can come up.  I  should check  if  this solution actually satisfies the original equation.
Let 's plug x  =  14  back into the original equation:
√ ( 14 ²  +  165 )  -  √ ( 14 ²  -  52 )  =  ?
Calculate  each  term:
14² is  196 . 
So  first radical:  √ ( 196  +  165 )  =  √ 361  =  19 .
Second  radical:  √ ( 196  -  52 )  =  √ 144  =  12 .
So  19  -  12  =  7 , which  is  exactly the right - hand side.  So  yes, it checks out.
Therefore ,  the only solution  is  x  =  14. Since  the problem says x  is  positive, we don't have to consider negative roots.  So  I  think that's the answer.To  solve the equation \(\sqrt{x ^ 2  +  165 }  -  \sqrt{x ^ 2  -  52 }  =  7 \)  for  positive \(x\), we proceed  as  follows:
1. Start  with  the given equation:   \[   \sqrt{x ^ 2  +  165 }  -  \sqrt{x ^ 2  -  52 }  =  7   \]
2. Isolate  one of the square roots by moving \(\sqrt{x ^ 2  -  52 }\) to the right side :    \[   \sqrt{x ^ 2  +  165 }  =  7  +  \sqrt{x ^ 2  -  52 }   \]
3. Square both sides to eliminate the square root   on the left:   \[   (\sqrt{x ^ 2  +  165 }) ^ 2  =  ( 7  +  \sqrt{x ^ 2  -  52 }) ^ 2   \]   Simplifying  both sides, we  get :   \[   x ^ 2  +  165  =  49  +  14 \sqrt{x ^ 2  -  52 }  +  (x ^ 2  -  52 )   \]
4. Combine  like  terms on the right side:   \[   x ^ 2  +  165  =  x ^ 2  -  52  +  49  +  14 \sqrt{x ^ 2  -  52 }   \]   Simplifying  further:   \[   x ^ 2  +  165  =  x ^ 2  -  3  +  14 \sqrt{x ^ 2  -  52 }   \]
5. Subtract  \(x ^ 2  \) from both sides :   \[   165  =  - 3  +  14 \sqrt{x ^ 2  -  52 }   \]
6. Add  3  to both sides to isolate the term with the square root :    \[   168  =  14 \sqrt{x ^ 2  -  52 }   \]
7. Divide   both sides  by 14 :   \[   12  =  \sqrt{x ^ 2  -  52 }   \]
8. Square  both sides again to eliminate the square root :    \[   12 ^ 2  =  x ^ 2  -  52   \]   Simplifying :   \[   144  =  x ^ 2  -  52   \]
9. Add  52 to both sides to   solve  for  \(x ^ 2 \):   \[   196  =  x ^ 2   \]
10. Take  the positive square root (since \(x\)  is  positive  ):    \[    x  =  \sqrt{ 196 }  =  14    \]
11. Verify  the solution by substituting \(x  =  14  \ ) back into the original equation:    \[    \sqrt{ 14 ^ 2  +  165 }  -  \sqrt{ 14 ^ 2  -  52 }  =  \sqrt{ 196  +  165 }  -  \sqrt{ 196  -  52 }  =  \sqrt{ 361 }  -  \sqrt{ 144 }  =  19  -  12  =  7    \]    The  solution checks out.
Thus , the only positive solution  is :\[\boxed{ 14 }\] < end_working_out >< SOLUTION > 14 </ SOLUTION ><| endoftext |> num_proc must be  <=  58 .  Reducing  num_proc to  58  for  dataset of size  58 .[ 2025-05-25 11:05:41 ] WARNING  arrow_dataset.py : 3010  :  num_proc must be < =  58. Reducing  num_proc  to 58 for dataset  of  size 58 .   
dataset.shape:( 58 ,  5 )

----- Processed pre-fine-tuning dataset: Dataset({ features: ['expected_answer', 'problem', 'generated_solution', 'Messages', 'N', 'text', '__index_level_0__'], num_rows: 58}) Unsloth: Tokenizing ["text"] (num_proc=58): 100%|██████████| 58/58 [00:07<00:00, 7.99 examples/s]

Then start pre-fine-tuning training:

==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 58 | Num Epochs = 2 | Total steps = 116O^O/ \_/ \ Batch size per device = 1 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)100%|██████████| 116/116 [00:47<00:00, 2.46it/s]Unsloth: Will smartly offload gradients to save VRAM!{'loss': 0.7447, 'grad_norm': 0.6478227376937866, 'learning_rate': 0.00016, 'epoch': 0.09}{'loss': 0.6066, 'grad_norm': 0.640754759311676, 'learning_rate': 0.00019279279279279282, 'epoch': 0.17}{'loss': 0.4543, 'grad_norm': 0.6311891674995422, 'learning_rate': 0.0001837837837837838, 'epoch': 0.26}{'loss': 0.4684, 'grad_norm': 0.5015860199928284, 'learning_rate': 0.00017477477477477476, 'epoch': 0.34}{'loss': 0.4063, 'grad_norm': 0.5008582472801208, 'learning_rate': 0.00016576576576576578, 'epoch': 0.43}{'loss': 0.3979, 'grad_norm': 0.5995965600013733, 'learning_rate': 0.00015675675675675676, 'epoch': 0.52}{'loss': 0.4248, 'grad_norm': 0.4734836518764496, 'learning_rate': 0.00014774774774774775, 'epoch': 0.6}{'loss': 0.4197, 'grad_norm': 0.5012277960777283, 'learning_rate': 0.00013873873873873876, 'epoch': 0.69}{'loss': 0.4511, 'grad_norm': 0.548245906829834, 'learning_rate': 0.00012972972972972974, 'epoch': 0.78}{'loss': 0.3974, 'grad_norm': 0.42141056060791016, 'learning_rate': 0.00012072072072072073, 'epoch': 0.86}{'loss': 0.3317, 'grad_norm': 0.4644368886947632, 'learning_rate': 0.0001117117117117117, 'epoch': 0.95}{'loss': 0.3846, 'grad_norm': 0.3927017152309418, 'learning_rate': 0.0001027027027027027, 'epoch': 1.03}{'loss': 0.2501, 'grad_norm': 0.5447007417678833, 'learning_rate': 9.36936936936937e-05, 'epoch': 1.12}{'loss': 0.278, 'grad_norm': 0.4823240339756012, 'learning_rate': 8.468468468468469e-05, 'epoch': 1.21}{'loss': 0.2645, 'grad_norm': 0.5164972543716431, 'learning_rate': 7.567567567567568e-05, 'epoch': 1.29}{'loss': 0.2584, 'grad_norm': 0.5759400725364685, 'learning_rate': 6.666666666666667e-05, 'epoch': 1.38}{'loss': 0.2121, 'grad_norm': 0.5618821978569031, 'learning_rate': 5.765765765765766e-05, 'epoch': 1.47}{'loss': 0.2322, 'grad_norm': 0.5534489154815674, 'learning_rate': 4.8648648648648654e-05, 'epoch': 1.55}{'loss': 0.2256, 'grad_norm': 0.6181885600090027,'learning_rate': 3.963963963963964e-05, 'epoch': 1.64}{'loss': 0.1841, 'grad_norm': 0.48197486996650696, 'learning_rate': 3.063063063063063e-05, 'epoch': 1.72}{'loss': 0.2789, 'grad_norm': 0.6069267988204956, 'learning_rate': 2.1621621621621624e-05, 'epoch': 1.81}{'loss': 0.2148, 'grad_norm': 0.5475031137466431, 'learning_rate': 1.2612612612612611e-05, 'epoch': 1.9}{'loss': 0.2263, 'grad_norm': 0.6717495918273926, 'learning_rate': 3.603603603603604e-06, 'epoch': 1.98}{'train_runtime': 47.1361, 'train_samples_per_second': 2.461, 'train_steps_per_second': 2.461, 'train_loss': 0.35218508824192246, 'epoch': 2.0}----- model.dtype of the trained model: torch.bfloat16

-----  Format follows the pre-fine-tuning training to complete, check whether the model has learned to follow our custom format  -----You  are given a problem.Think  about the problem and provide your working out.Place  it between  < start_working_out >  and  < end_working_out > .Then , provide your solution between  < SOLUTION >  and  </ SOLUTION ><| endoftext | > Jenifer has  82  cents  in  pennies and nickels  .  Her younger brother mistook  all   her nickels for dimes  and  counted the total as   $  1 . 47 .  |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad | ><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad | > <| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |>< | vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |> ol  let 's start by setting up the problem.  Let 's denote the number of pennies  as  \( p \) and the number of nickels  as  \( n \).  We  know that the total value of the coins  is 82  cents, so we can write the equation: \[ p  +  5n  =  82  \]
Next , we need to consider the mistake made by  Jenifer 's younger brother.  He  mistook all the nickels  for  dimes, so he counted the total  as  $1 . 47 , which  is  147  cents.  This  gives us another equation:\[ p  +  10n  =  147  \]
Now , we have a system of two equations:1. \(p  +  5n  =  82  \)2. \(p  +  10n  =  147  \)
To  solve this system, we can subtract the first equation from the second to eliminate \( p \):\[ (p  +  10n)  -  (p  +  5n)  =  147  -  82  \]\[ 5n  =  65  \]\[ n  =  13  \]
Now  that we have \( n  =  13  \), we can substitute this value back into the first equation to find \( p \):\[ p  +  5 ( 13 )  =  82  \]\[ p  +  65  =  82  \]\[ p  =  17  \]
So ,  Jenifer  has  17  pennies.  Let 's verify the solution:-  The  value of  17  pennies  is  \(  17  \times  1  =  17  \) cents.-  The  value of  13  nickels  is  \(  13  \times  5  =  65  \) cents.-  The  total value  is  \(  17  +  65  =  82  \) cents, which matches the given total.
Thus , the number of pennies  Jenifer  has  is  \(\boxed{ 17 }\) .< end_working_out >< SOLUTION > 17 </ SOLUTION ><| endoftext |>

As you can see, the model output after pre-fine-tuning meets expectations. The reasoning process is placed between the specified tags <start_working_out> and <end_working_out>, and the final answer is placed between the specified tags. (<start_working_out> will be added to the dialogue question to guide the model to output reasoning. Therefore, the model does not output this start tag, but directly outputs the thinking content, and then ends the thinking with <end_working_out>).

Processing the GRPO dataset

===== step4. Load and process the dataset =============================================================----- Dataset DAPO-Math-17k-Processed:Dataset({ features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'], num_rows: 14116})----- The first prompt: In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.----- Solution of the first item: 34Map: 100%|██████████| 14116/14116 [00:01<00:00, 11664.38 examples/s]----- Content of the 1st conversation format: {'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION> and </SOLUTION>', 'role': 'system'}, {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.', 'role': 'user'}], 'solution': '34', 'data_source': 'math_dapo', 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.\n\nRemember to put your answer on its own line after "Answer:".', 'role': 'user'}], 'ability': 'MATH', 'reward_model': {'ground_truth': '34', 'style': 'rule-lighteval/MATH_v2'}, 'extra_info': {'index': '9a9b6eb4-a1cb-49d1-8c1e-62eaf2f74079'}, 'answer': '34'}Map: 100%|██████████| 14116/14116 [00:04<00:00, 3005.13 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then,provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$. [00:02<00:00, 5114.02 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>Map: 100%|██████████| 14116/14116 [00:02<00:00, 5114.02 examples/s]==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 12,709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)Max Length = 203709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)Max Length = 203

Define and test the reward function

Four reward functions are defined in the training script. The following are scoring examples for two of them:

===== step5. Define and test reward function ==================================================================match_format:re.compile(' < end_working_out > .*? < SOLUTION > (.+?) </ SOLUTION > [\\s]{0,}(?:<\\|endoftext\\|>)?[\\s]{0,}$', re.MULTILINE|re.DOTALL)
----- Reward function check_answer scoring example:
Case | Question | Response | Answer | Extracted | Score------------------------------------------------------------------------------------------1 | Q: 2+2 = ? | Let me think! < end_working_out >< SOLUTION > 4 </ SOLUTION >  | 4 | 4 | 5.02 | Q: Hello? |  < start_working_out > think.. < end_working_out >< SOLUTION >   yes   </ SOLUTION >  | yes | yes | 3.53 | Q: Value? | ! < end_working_out >< SOLUTION > 9.5 </ SOLUTION >  | 10 | 9.5 | 2.04 | Q: Value? | ! < end_working_out >< SOLUTION > 8.3 </ SOLUTION >  | 10 | 8.3 | 1.55 | Q: Value? | i! < end_working_out >< SOLUTION > 5 </ SOLUTION >  | 10 | 5 | -2.56 | Q: Answer? | i! < end_working_out >< SOLUTION > no digit </ SOLUTION >  | 42 | no digit | -4.57 | Q: String? | f! < end_working_out >< SOLUTION > oobar </ SOLUTION >  | baz | oobar | -4.5
----- Reward function check_numbers scoring example:
Case | Question | Response(s) | Answer(s) | Score(s)--------------------------------------------------------------------------------------1 | Q: 2+2=? |  < SOLUTION > 4 | 4 | 3.52 | Q: Total amount? |  < SOLUTION >  1,234.00 | 1234.0 | 3.53 | Q: Output |  < SOLUTION > No number | 0 | -2.54 | Q: 10-3=? |  < SOLUTION > 5 | 7 | -1.55 | Q: 1+1=?,2+2=? |  < SOLUTION > 2,4 | 2,4 | 3.5,-2.5

Conduct GRPO training

===== step6. Training model========================================================================. . .  5%|▌ | 5/100 [02:19<40:11, 25.38s/it]************************Question:. . .100%|██████████| 100/100 [46:13<00:00, 27.74s/it]can fire at a single
Extracted:None{ 'loss' : 0.0067,  'grad_norm' : 0.2612026631832123,  'learning_rate' : 2.777777777777776e-07,  'rewards/match_format_exactly' : 2.25,  'rewards/match_format_approximately' : 0.375,  'rewards/check_answer' : 3.25,  'rewards/check_numbers' : 2.0,  'reward' : 7.875,  'reward_std' : 10.25,  'completion_length' : 1379.25,  'kl' : 0.1681276112794876,  'epoch' : 0.01}{ 'loss' : 0.0052,  'grad_norm' : 0.22037602961063385,  'learning_rate' : 2.222222222222224e-07,  'rewards/match_format_exactly' : 1.5,  'rewards/match_format_approximately' : -0.75,  'rewards/check_answer' : 1.5,  'rewards/check_numbers' : 0.5,  'reward' : 2.75,  'reward_std' : 11.83568000793457,  'completion_length' : 1665.75,  'kl' : 0.1294582337141037,  'epoch' : 0.01}{ 'loss' : 0.0043,  'grad_norm' : 0.18991202116012573,  'learning_rate' : 1.66666666666668e-07,  'rewards/match_format_exactly' : 0.75,  'rewards/match_format_approximately' : -1.875,  'rewards/check_answer' : -0.25,  'rewards/check_numbers' : -1.0,  'reward' : -2.375,  'reward_std' : 10.25,  'completion_length' : 1775.5,  'kl' : 0.1086689755320549,  'epoch' : 0.01}{ 'loss' : 0.0046,  'grad_norm' : 0.025854697450995445,  'learning_rate' : 1.111111111111112e-07,  'rewards/match_format_exactly' : 0.0,  'rewards/match_format_approximately' : -3.0,  'rewards/check_answer' : -2.0,  'rewards/check_numbers' : -2.5,  'reward' : -7.5,  'reward_std' : 0.0,  'completion_length' : 1844.0,  'kl' : 0.11534835398197174,  'epoch' : 0.01}{ 'loss' : 0.0052,  'grad_norm' : 0.05398529767990112,  'learning_rate' : 5.55555555555556e-08,  'rewards/match_format_exactly' : 0.0,  'rewards/match_format_approximately' : -3.0,  'rewards/check_answer' : -2.0,  'rewards/check_numbers' : -2.5,  'reward' : -7.5,  'reward_std' : 0.0,  'completion_length' : 1844.0,  'kl' : 0.1298024207353592,  'epoch' : 0.01}{ 'train_runtime' : 2773.8914,  'train_samples_per_second' : 0.144,  'train_steps_per_second' : 0.036,  'train_loss' : 0.005772246685810387,  'epoch' : 0.01}

Here we set max_steps = 100 to complete the test as quickly as possible. During formal training, you can control the number of training rounds by setting epochs and end the training early based on the loss convergence.

Testing the trained model

Processed prompts: 100%|██████████| 1/1 [00:11<00:00, 11.75s/it, est. speed input: 0.85 toks/s, output: 87.15 toks/s]----- Answers from the basic model: - AnswersMath and ArithmeticWhat is the sqrt of 101?Wiki User∙ 2010-05-29 22:38:13Best AnswerCopyThe square root of 101 is 10.0498756 approximatelyWiki User∙ 2010-05-29 22:38:13This answer is:?0?0?0What is the square root of 101?It is approx. root of -101?The square root of -101 can be written as the product of the positive square root of 101 and i (where i is an imaginary number). The square root of 101 is approximately 10.04987751.What is the square root of 101 simplified?sqrt(101) is already simplified since 101 is not a perfect square. Also, we cannot simplify it since 101 is a prime number. (In other words, 101 = 1 x 101, so its only factorization is 1 and 101) In decimal form it is: 10.049875621120891586572919348985505109596599416484785647300807046What is the square root of 20200?sqrt (20200) = sqrt (4 x 100 x 505) = What is the square root of 3 sqrt (4) x sqrt (100) x sqrt (505) = 2 x 10 x sqrt (505) = 20 x sqrt (505) = 10 x sqrt (4) in 101?sqrt(3 in 101) = sqrt(101) x sqrt(3) = sqrt(101 x 3) = sqrt(303) = 17.4069...approx.What is an irrational number?-101 as a number.Why is one sixth of one third the same as one square root of one hundred sixty nine?sqrt(169) = 13 1/6 of 1/3 = (1/6)(1/3) = 1/(6x3) = 1/18 = (1/13)(1/13) = 1/sqrt(169)What number when squared equals six?If you meant sqrt(6)2 then this = 6 and sqrt(6) = 2.4494... For the number to be a square root you need the 6 to be in the denominator or the square root of 6.What is -20 sqrt of 101?-20 square root of (101) - 20 * sqrt(101) - 20 * sqrt(101) is a real number and cannot be simplified any further.What is the square root of 101.8?9.046921979920897...What is the square root of 7561?As sqrt(7561) = sqrt(169)*sqrt(41) = 13*sqrt(41) ~= 274.84779...What is the sqrt of 0.25?The sqrt of 0.25 is 0.5What is the square root of 169 over 10?It is 1.3How do you find the square root of 51?The square root of 51 is approx. 7.1414 The easiest way to do that is to use a calculator. If you do not have a calculator, I strongly suggest using one, since the sqrt of 51 is an irrational number with an infinite amount of decimal places. Square and cube roots can be calculated the old-fashioned manner by using trial and error. 72 = 49 which is too small; 82 = 64 which is too big, etc. If you need to go that route, you need to know your basicProcessed prompts: 100%|██████████| 1/1 [00:26<00:00, 26.65s/it, est. speed input: 2.25 toks/s, output: 74.61 toks/s]

----- GRPO-LoRA model's answer:Okay, so I need to find the square root of  101.  Hmm,  let  me think. The square root of a number  is  the  value  that,  when  multiplied  by  itself, gives the original number. But  101  seems like it 's not a perfect square, right? I remember that perfect squares like 100, 121  ,  144, etc., are  numbers  that have exact square roots since they' re squares of integers.  100 , which  is 10  squared. So maybe √ 101 is  close to  10  but  not  exactly  10.  Let me calculate  10  squared first.  10  ×  10 is 100.  So √ 101 is  a little more than  10.  How much more? Well,  10.5  squared  is 110.25 , which  is  higher than  101.  So it has to be between  10 and 10.5 . Maybe  10.05 ? Let  me try that  .  10.05  squared  is 10.05  ×  10.05 .  Let  me  compute  that .    100  +  0.5  +  0.5  +  0.0025  =  101.0025 . That 's very close to 101, so √101 is approximately 10.05. But wait, 10.04 squared might be slightly less. Let me check 10.04: 10.04 × 10.04. This is 10.04 × 10 = 100.4, then 0.04 × 10 = 0.4, and 0.04×0.04 = 0.0016. So total: 100.4 + 0.4 = 100.8, plus 0.0016, which is 100.8016. That's less than  101.  So √ 101 is  between  10.04 and 10.05 . Maybe  10.045 ? Let me  try 10.045 :  10.045  squared. Hmm,  10 × 10 = 100 ,  10 × 0.045                       So it  ' s                closer to 10.04. Maybe 10.042? Let me try 10.042: 10.042 squared. 10×10=100, 10×0.042=0.42, 0.042×10=0.42, 0.042×0.042=0.001764. Adding: 100 + 0.42 + 0.42 = 100.84, plus 0.001764 is 100.841764, which is still below 101. So it's closer to  10.043 . Let me  try  10.043 :  10.043  squared.  10 × 10 = 100 ,  10 × 0.043 = 0.43 ,  0.043 × 10 = 0.43 ,  0.043 × 0.043 = 0.001849 . Adding:  100  +  0.43  +  0.43  =  100.86 , plus  0.001849  is  100.861849 , still less than  101.  So √ 101  is  approximately  10.043 . But I wonder  if  there 's a better way. Maybe using the Newton-Raphson method for square roots? Let me recall that method. Let x be the number we want to find the square root of. Start with an initial guess, say 10. Then the method is x_(new) = (x + N/x)/2, where N is the number, which is 101. So first guess is 10. 20.1/2 = 10.05. Next iteration: (10.05 + 101/10.05)/2. 101 divided by 10.05 is approximately 10.049751. So 10.05 + 10.049751 = 20.099751, divided by 2 is 10.049875. the approximation is approximately 10.049875. That's more accurate than my previous guesses. So the square root of  101  is  approximately  10.049875 . But the problem didn 't specify how precise the answer needs to be, so maybe just the decimal approximation is acceptable. So let me express that in decimal form. 10.049875... Hmm, four decimal places would be 10.0499. But to be precise, I should keep more. So maybe 10.049875? But that' sa bit too far. Let me verify  with  the Newton-Raphson method again. Start  with  10.  ( 10  +  101/ 10 )/ 2  =  10.05 . Then ( 10.05  +  101 / 10.05 ) / 2.  101 / 10.05  is  10.04975 . So  10.05  +  10.04975  =  20.09975 . Divided  by  2  is  10.049875 . Then next iteration: ( 10.049875  +  101 / 10.049875 )/ 2.  101  divided  by  10.049875  is  approximately  10.049874 . So  10.049875  +  10.049874  =  20.099749 , divided  by  2  is  10.0498745 . So after three iterations, it 's approximately 10.0498745. So the square root of 101 is approximately 10.049875. Therefore, I think the answer is around 10.05. But for a better approximation, maybe using more iterations or a calculator. But since the problem doesn' t specify, I 'll go with 10.05. Let me check if 10.05 squared is 101. 10.05 × 10.05. Let me multiply that out. 10×10=100, 10×0.05=0.5, 0.05×10=0.5, 0.05×0.05=0.0025. Adding: 100 + 0.5 + 0.5 = 101, plus 0.0025 is 101.0025. So √101 is slightly less than 10.05. Therefore, the square root is approximately 10.049875. So in decimal form, that's about  10.050 . Therefore, the square root of  101  is  approximately  10.05 .To find the square root of  101 , we can use the Newton-Raphson method  for  approximation. The method starts  with  an initial guess  and  iteratively refines it.
1.  **Initial Guess**: Start  with  \( x_0

The answer here is truncated due to length issues. This is because the test code in this article sets the maximum number of generated tokens, max_tokens = 1024, during inference; these issues have been resolved in the latest training code. See:

"One-click GRPO fine-tuning of the training code for the latest Qwen3 model on a single 4090 card"

Merge and save models

After Lora training is completed, the original model will not be changed, nor will a completely new model be generated. Instead, an additional smaller Lora weight is generated, which contains the trained content. In the above test, this Lora weight is used as a plug-in and loaded together with the original weight for inference testing. After the test is completed, we need to merge the plug-in lora weight with the original model to generate a new complete model.

===== step8. Merging and saving model ============================================================= Unsloth: Merging 4bit and LoRA weights to 16bit...Unsloth: Will use up to 336.32 out of 503.72 RAM for saving.Unsloth: Saving model... This might take 5 minutes... 17%|█▋ | 6/36 [00:00<00:00, 57.02it/s]We will save to Disk and not RAM now.100%|██████████| 36/36 [00:05<00:00, 6.11it/s]Unsloth: Saving tokenizer... Done.Done.

The obtained model can be further quantified as needed.