One-click GRPO fine-tuning Qwen3 latest model on single card 4090

Core content:
1. Steps to download the Qwen3 model and required datasets
2. Use Docker to start the container for model fine-tuning
3. Use the Unsloth Zoo to accelerate the fine-tuning process
Fine-tune the latest Qwen3 model with one click to improve model performance and accelerate the training process.
Model and dataset download
Because of the domestic network environment, I have a fear of connection timeout, so the first thing is to download what needs to be downloaded, and don't download dynamically during operation. The model and dataset address used in this article are:
-
https://modelscope.cn/models/Qwen/Qwen3-4B-Base
-
https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini
-
https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed
Download command:
modelscope download --model Qwen/Qwen3-4B-Base --revision master --local_dir /models/Qwen/Qwen3-4B-Basehuggingface-cli download --resume-download --repo-type dataset unsloth/OpenMathReasoning-mini --local-dir unsloth/OpenMathReasoning-minihuggingface-cli download --resume-download --repo-type dataset open-r1/DAPO-Math-17k-Processed --local-dir open-r1/DAPO-Math-17k-Processed
Start the container
# docker run --name unsloth0517 -itd --gpus '"device=4"' \
-v /data/ai/models:/models \
-v /data/ai/datasets:/datasets \
-v /data/ai/workspace/unsloth:/workspace \
unsloth:20250517_4cd5_cu121 bash
# docker exec -it unsloth0517 bash
root@ 1855d8235e1a:/home/unsloth # cd /workspace/scripts
The construction method of the Docker image unsloth:20250517_4cd5_cu121 is described in detail in the previous article: "Full steps for fine-tuning Qwen3-32B for a single RTX4090 card".
root@1855d8235e1a:/workspace/scripts# python unsloth-grpo-qwen3.py? Unsloth: Will patch your computer to enable 2x faster free finetuning.? Unsloth Zoo will now patch everything to make training faster! Traceback (most recent call last): File "/workspace/scripts/unsloth-grpo-qwen3.py", line 17, in <module> model, tokenizer = FastLanguageModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/unsloth/models/loader.py", line 138, in from_pretrained raise ImportError(ImportError: Unsloth: Please install vLLM before enabling `fast_inference`! You can do this in a terminal via `pip install vllm`
Because the unsloth:20250517_4cd5_cu121 container image does not contain vllm, this error is reported.
We can create another image containing VLLM based on the unsloth:20250517_4cd5_cu121 image. For simplicity, we will install VLLM directly in the container:
root@1855d8235e1a: /workspace/s scripts # export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/
root@1855d8235e1a: /workspace/s scripts # pip install vllm
. . .
Successfully installed airportsdata- 20250224 annotated-types- 0 . 7.0 anyio- 4.9 . 0 astor- 0 . 8.1 blake3- 1.0 . 5 cachetools- 5.5 . 2 cloudpickle- 3.1 . 1 compressed-tensors- 0 . 9.3 cupy-cuda12x- 13.4 . 1 deprecated- 1.2 . 18 depyf- 0 . 18.0 diskcache- 5.6 . 3 einops- 0 . 8.1 email-validator- 2.2 . 0 fastapi- 0 . 115.12 fastapi-cli- 0 . 0 . 7 fastrlock- 0 . 8.3 gguf- 0.16.3 googleapis-common-protos- 1.70 . 0 h11- 0 . 16.0 hf-xet- 1.1 . 2 httpcore- 1.0 . 9 httptools- 0 . 6.4 httpx- 0 . 28.1 importlib_metadata- 8.0 . 0 interegular- 0 . 3.3 jinja2- 3.1 . 6 jiter- 0 . 10.0 lark- 1.2 . 2 llguidance- 0 . 7.22 llvmlite- 0 . 44.0 lm- format -enforcer- 0 . 10.11 mistral_common- 1.5 . 5 msgpack- 1.1 . 0 nest_asyncio- 1.6 . 0 numba- 0 . 61.2 nvidia-cublas-cu12- 12.4 . 5.8 nvidia-cuda-cupti-cu12- 12.4 . 127 nvidia-cuda-nvrtc-cu12- 12.4 . 127 nvidia-cuda-runtime-cu12- 12.4 . 127 nvidia-cufft-cu12- 11.2 . 1.3 nvidia-curand-cu12- 10.3. 5.147 nvidia-cusolver-cu12- 11.6 . 1.9 nvidia-cusparse-cu12- 12.3 . 1.170 nvidia-cusparselt-cu12- 0 . 6.2 nvidia-nvjitlink-cu12- 12.4 . 127 nvidia-nvtx-cu12- 12.4 . 127 openai- 1.81 . 0 opencv-python-headless- 4.11 . 0 . 86 opentelemetry-api- 1.26 . 0 opentelemetry-exporter-otlp- 1.26 . 0 opentelemetry-exporter-otlp-proto-common- 1.26 . 0 opentelemetry-exporter-otlp-proto-grpc- 1.26 . 0 opentelemetry-exporter-otlp-proto-http- 1.26 . 0 opentelemetry-proto- 1.26 . 0 opentelemetry-sdk- 1.26 . 0 opentelemetry-semantic-conventions- 0 . 47 b 0 opentelemetry-semantic-conventions-ai- 0 . 4.9 outlines- 0 . 1.11 outlines_core- 0 . 1.26 partial-json-parser- 0 . 2.1 . 1 .post5 pillow- 11.2 . 1 prometheus-fastapi-instrumentator- 7.1 . 0 prometheus_client- 0 . 22.0 py-cpuinfo- 9.0 . 0 pycountry- 24.6 . 1 pydantic- 2.11 . 4 pydantic-core- 2.33 . 2 python-dotenv- 1.1 . 0 python-json-logger- 3.3 . 0 python-multipart- 0 . 0 . 20 pyzmq- 26.4 . 0 ray- 2.46 . 0 rich-toolkit- 0 . 14.6 scipy- 1.15 . 3 shellingham- 1.5 . 4 sniffio- 1.3 . 1 starlette- 0 . 46.2 tiktoken- 0 .9.0 torch- 2.6 . 0 torchaudio- 2.6 . 0 torchvision- 0 . 21.0 triton- 3.2 . 0 typer- 0 . 15.4 typing-inspection- 0 . 4.1 uvicorn- 0 . 34.2 uvloop- 0 . 21.0 vllm- 0 . 8.5 .post1 watchfiles- 1.0 . 5 websockets - 15.0 . 1 wrapt - 1.17 . 2 xformers - 0 . 0 . 29 .post2 xgrammar - 0 . 1.18
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behavior with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https: //pip .pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Start training
After entering the container, make sure that the following directories are visible in the container:
-
Base model directory to be trained: /models/Qwen/Qwen3-4B-Base
-
Dataset 1: /datasets/unsloth/OpenMathReasoning-mini/data/cot-00000-of-00001.parquet
-
Dataset 2:/datasets/open-r1/DAPO-Math-17k-Processed/en/train-00000-of-00001.parquet
-
Working directory and training code file: /workspace/scripts/unsloth-grpo-qwen3.py
Execute the following code in the /workspace/scripts/ directory of the container to start training:
cat unsloth-grpo-qwen3.py > unsloth-grpo-qwen3.py.log && \ nohup python unsloth-grpo-qwen3.py >> unsloth-grpo-qwen3.py.log 2>&1 &
We intentionally flush the training code to the front of the training log to fix it. The advantage of doing this is that it is convenient for troubleshooting during code iteration.
The latest version of the training code unsloth-grpo-qwen3.py has optimized the parameter boundaries for the 4090 card with 24G video memory, and has made detailed comments. The code content is in the previous article: https://mp.weixin.qq.com/s/olblI2gE3HHDSEGnejGBrw. Friends who need it can use it by themselves. In order to quickly run the test, this article limits the parameters such as max_steps. The following is the log analysis corresponding to each stage of the training code execution:
Training log
Loading the model
? Unsloth: Will patch your computer to enable 2 x faster free finetuning.
? Unsloth Zoo will now patch everything to make training faster !
INFO 05-25 10 : 56 : 06 [importing.py: 53 ] Triton module has been replaced with a placeholder.
INFO 05-25 10 : 56 : 06 [__init__.py: 239 ] Automatically detected platform cuda.
===== step1. Load the model ==========================================================================
== (( ==== )) == Unsloth 2025.5.6 : Fast Qwen3 patching. Transformers: 4.51.3 . vLLM: 0.8.5 .post1.
\\ /| NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.65 GB. Platform: Linux.
O ^ O / \_ / \ Torch: 2.6.0 + cu124. CUDA: 8.9 . CUDA Toolkit: 12.4 . Triton: 3.2.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29 .post2. FA2 = False ]
"-____-" Free license: http: // github.com / unslothai / unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored !
Unsloth: vLLM loading / models / Qwen / Qwen3 -4 B - Base with actual GPU utilization = 68.76 %
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 23.65 GB.
Unsloth: Using conservativeness = 1.0 . Chunked prefill tokens = 2048. Num Sequences = 224.
Unsloth: vLLM 's KV Cache can use up to 9.31 GB. Also swap space = 6 GB.
INFO 05-25 10:59:17 [config.py:717] This model supports multiple tasks: {'embed ', ' generate ', ' score ', ' reward ', ' classify '}. Defaulting to ' generate '.
INFO 05-25 10:59:17 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-25 10:59:17 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: . . .
WARNING 05-25 10:59:17 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f19973bbd50>
INFO 05-25 10:59:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-25 10:59:28 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-25 10:59:28 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-25 10:59:28 [gpu_model_runner.py:1329] Starting to load model /models/Qwen/Qwen3-4B-Base...
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:02<00:01, 1.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.52s/it]
INFO 05-25 10:59:33 [loader.py:458] Loading weights took 4.81 seconds
INFO 05-25 10:59:33 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 05-25 10:59:33 [gpu_model_runner.py:1347] Model loading took 7.6334 GiB and 5.084545 seconds
INFO 05-25 10:59:48 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/f7b249c75c/rank_0_0 for vLLM's torch.compile
INFO 05-25 10:59:48 [ backends.py: 430 ] Dynamo bytecode transform time : 14.75 s
Inductor Compilation: 100 %| ██████████ | 6 / 6 [ 00 : 01 < 00 : 00 , 5.13 it / s, triton_poi_fused_add_mul_sub_5]
INFO 05-25 10 : 59 : 53 [backends.py: 136 ] Cache the graph of shape None for later use
. . .
Inductor Compilation : 100 % | 5 / 5
INFO 05-25 11 : 00 : 39 [backends.py: 148 ] Compiling a graph for general shape takes 49.26 s
INFO 05-25 11 : 03 : 14 [monitor.py: 33 ] torch.compile takes 64.01 s in total
INFO 05-25 11:03:18 [ kv_cache_utils.py: 634 ] GPU KV cache size : 49 , 856 tokens
INFO 05-25 11 : 03 : 18 [kv_cache_utils.py: 637 ] Maximum concurrency for 2 , 048 tokens per request: 24.34 x
INFO 05-25 11 : 04 : 19 [gpu_model_runner.py: 1686 ] Graph capturing finished in 61 secs, took 3.94 GiB
INFO 05-25 11 : 04 : 19 [core.py: 159 ] init engine (profile, create kv cache, warmup model) took 286.49 seconds
Unsloth 2025.5.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
Model structure
For the loaded model, the structure after inserting LORA:
/models/Qwen/Qwen3-4B-Base does not have a padding token! Will use pad_token = <|vision_pad|>.model:PeftModelForCausalLM( (base_model): LoraModel( (model): Qwen3ForCausalLM( (model): Qwen3Model( (embed_tokens): Embedding(151936, 2560, padding_idx=151654) (layers): ModuleList( (0-35): 36 x Qwen3DecoderLayer( (self_attn): Qwen3Attention( (q_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B):ParameterDict() (lora_magnitude_vector): ModuleDict() ) (q_norm): Qwen3RMSNorm((128,), eps=1e-06) (k_norm): Qwen3RMSNorm((128,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): Qwen3MLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))
GRPO Dialogue Template
===== step2. Prepare GRPO dialogue template ================================================================ Dialog template output example: You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>What is 1+1?<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION><|endoftext|>What is 2+2?<start_working_out>
The format follows pre-fine-tuning
Before GRPO training, we first use the conversation data with the reasoning process to perform a simple SFT training on the original model to enable the model to have the ability to output in our reasoning format.
First, we need to clean the original data set and filter out the data suitable for format fine-tuning:
===== step3. The format follows the pre-fine-tuning ===================================================================
----- Cleaned data set:
expected_answer ... generated_solution
0 14 ... < think > \nOkay, let's see. I need to solve the ...
6 -2 ... < think > \nOkay, so I need to find the value of ...
9 18 ... < think > \nOkay, so I need to solve the equation...
13 2 ... < think > \nOkay, so I need to evaluate the infin...
17 30 ... < think > \nAlright, so I need to find the larges...
... ... ... ...
19243 244 ... < think > \nOkay, so I need to find the value of ...
19245 1 ... < think > \nOkay, so I have this problem where a ...
19247 4 ... < think > \nOkay, let's tackle this problem step ...
19248 18 ... < think > \nOkay, let's see. I need to find the n...
19250 0.8960 ... < think > \nOkay, so I need to find the probabili...
[7507 rows x 3 columns]
----- Output of the first OpenMathReasoning data formatted using the dialog template:
You are given a problem.
Think about the problem and provide your working out.
Place it between < start_working_out > and < end_working_out > .
Then , provide your solution between < SOLUTION > and </ SOLUTION ><| endoftext |> Given $\sqrt{x ^ 2 + 165 } - \sqrt{x ^ 2 - 52 } = 7 $ and $x $ is positive, find all possible values of $x $ .< start_working_out > Okay , let 's see. I need to solve the equation √ (x² + 165 ) - √ (x² - 52 ) = 7 , and find all positive values of x. Hmm , radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.
First , let me write down the equation again to make sure I have it right:
√ (x² + 165 ) - √ (x² - 52 ) = 7 .
Okay , so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:
√ (x² + 165 ) = 7 + √ (x² - 52 ).
Now , if I square both sides, maybe I can get rid of the square roots. Let 's do that:
( √ (x² + 165 ))² = ( 7 + √ (x² - 52 ))².
Simplifying the left side:
x² + 165 = 49 + 14 √ (x² - 52 ) + ( √ (x² - 52 ))².
The right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7 ² + 2 * 7 *√ (x² - 52 ) + ( √ (x² - 52 ))², which is 49 + 14 √ (x² - 52 ) + (x² - 52 ).
So putting it all together:
x² + 165 = 49 + 14 √ (x² - 52 ) + x² - 52 .
Hmm , let 's simplify the right side. The x² terms will cancel out, right ? Let 's subtract x² from both sides:
165 = 49 + 14 √ (x² - 52 ) - 52 .
Simplify the constants on the right:
49 - 52 is - 3 , so:
165 = - 3 + 14 √ (x² - 52 ).
Now , add 3 to both sides to isolate the radical term:
165 + 3 = 14√ (x² - 52 ) .
So 168 = 14√ (x² - 52 ) .
Divide both sides by 14 :
168 / 14 = √ (x² - 52 ).
12 = √ (x² - 52 ).
Now , square both sides again to eliminate the square root:
12² = x² - 52 .
144 = x² - 52 .
Add 52 to both sides:
144 + 52 = x².
196 = x².
So x = √ 196 = 14 .
But wait, since the problem states that x is positive, we only take the positive root. So x = 14 .
But hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.
Let 's plug x = 14 back into the original equation:
√ ( 14 ² + 165 ) - √ ( 14 ² - 52 ) = ?
Calculate each term:
14² is 196 .
So first radical: √ ( 196 + 165 ) = √ 361 = 19 .
Second radical: √ ( 196 - 52 ) = √ 144 = 12 .
So 19 - 12 = 7 , which is exactly the right - hand side. So yes, it checks out.
Therefore , the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.
To solve the equation \(\sqrt{x ^ 2 + 165 } - \sqrt{x ^ 2 - 52 } = 7 \) for positive \(x\), we proceed as follows:
1. Start with the given equation:
\[
\sqrt{x ^ 2 + 165 } - \sqrt{x ^ 2 - 52 } = 7
\]
2. Isolate one of the square roots by moving \(\sqrt{x ^ 2 - 52 }\) to the right side :
\[
\sqrt{x ^ 2 + 165 } = 7 + \sqrt{x ^ 2 - 52 }
\]
3. Square both sides to eliminate the square root on the left:
\[
(\sqrt{x ^ 2 + 165 }) ^ 2 = ( 7 + \sqrt{x ^ 2 - 52 }) ^ 2
\]
Simplifying both sides, we get :
\[
x ^ 2 + 165 = 49 + 14 \sqrt{x ^ 2 - 52 } + (x ^ 2 - 52 )
\]
4. Combine like terms on the right side:
\[
x ^ 2 + 165 = x ^ 2 - 52 + 49 + 14 \sqrt{x ^ 2 - 52 }
\]
Simplifying further:
\[
x ^ 2 + 165 = x ^ 2 - 3 + 14 \sqrt{x ^ 2 - 52 }
\]
5. Subtract \(x ^ 2 \) from both sides :
\[
165 = - 3 + 14 \sqrt{x ^ 2 - 52 }
\]
6. Add 3 to both sides to isolate the term with the square root :
\[
168 = 14 \sqrt{x ^ 2 - 52 }
\]
7. Divide both sides by 14 :
\[
12 = \sqrt{x ^ 2 - 52 }
\]
8. Square both sides again to eliminate the square root :
\[
12 ^ 2 = x ^ 2 - 52
\]
Simplifying :
\[
144 = x ^ 2 - 52
\]
9. Add 52 to both sides to solve for \(x ^ 2 \):
\[
196 = x ^ 2
\]
10. Take the positive square root (since \(x\) is positive ):
\[
x = \sqrt{ 196 } = 14
\]
11. Verify the solution by substituting \(x = 14 \ ) back into the original equation:
\[
\sqrt{ 14 ^ 2 + 165 } - \sqrt{ 14 ^ 2 - 52 } = \sqrt{ 196 + 165 } - \sqrt{ 196 - 52 } = \sqrt{ 361 } - \sqrt{ 144 } = 19 - 12 = 7
\]
The solution checks out.
Thus , the only positive solution is :
\[
\boxed{ 14 }
\] < end_working_out >< SOLUTION > 14 </ SOLUTION ><| endoftext |> num_proc must be <= 58 . Reducing num_proc to 58 for dataset of size 58 .
[ 2025-05-25 11:05:41 ] WARNING arrow_dataset.py : 3010 : num_proc must be < = 58. Reducing num_proc to 58 for dataset of size 58 .
dataset.shape:( 58 , 5 )
----- Processed pre-fine-tuning dataset: Dataset({ features: ['expected_answer', 'problem', 'generated_solution', 'Messages', 'N', 'text', '__index_level_0__'], num_rows: 58}) Unsloth: Tokenizing ["text"] (num_proc=58): 100%|██████████| 58/58 [00:07<00:00, 7.99 examples/s]
Then start pre-fine-tuning training:
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 58 | Num Epochs = 2 | Total steps = 116O^O/ \_/ \ Batch size per device = 1 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)100%|██████████| 116/116 [00:47<00:00, 2.46it/s]Unsloth: Will smartly offload gradients to save VRAM!{'loss': 0.7447, 'grad_norm': 0.6478227376937866, 'learning_rate': 0.00016, 'epoch': 0.09}{'loss': 0.6066, 'grad_norm': 0.640754759311676, 'learning_rate': 0.00019279279279279282, 'epoch': 0.17}{'loss': 0.4543, 'grad_norm': 0.6311891674995422, 'learning_rate': 0.0001837837837837838, 'epoch': 0.26}{'loss': 0.4684, 'grad_norm': 0.5015860199928284, 'learning_rate': 0.00017477477477477476, 'epoch': 0.34}{'loss': 0.4063, 'grad_norm': 0.5008582472801208, 'learning_rate': 0.00016576576576576578, 'epoch': 0.43}{'loss': 0.3979, 'grad_norm': 0.5995965600013733, 'learning_rate': 0.00015675675675675676, 'epoch': 0.52}{'loss': 0.4248, 'grad_norm': 0.4734836518764496, 'learning_rate': 0.00014774774774774775, 'epoch': 0.6}{'loss': 0.4197, 'grad_norm': 0.5012277960777283, 'learning_rate': 0.00013873873873873876, 'epoch': 0.69}{'loss': 0.4511, 'grad_norm': 0.548245906829834, 'learning_rate': 0.00012972972972972974, 'epoch': 0.78}{'loss': 0.3974, 'grad_norm': 0.42141056060791016, 'learning_rate': 0.00012072072072072073, 'epoch': 0.86}{'loss': 0.3317, 'grad_norm': 0.4644368886947632, 'learning_rate': 0.0001117117117117117, 'epoch': 0.95}{'loss': 0.3846, 'grad_norm': 0.3927017152309418, 'learning_rate': 0.0001027027027027027, 'epoch': 1.03}{'loss': 0.2501, 'grad_norm': 0.5447007417678833, 'learning_rate': 9.36936936936937e-05, 'epoch': 1.12}{'loss': 0.278, 'grad_norm': 0.4823240339756012, 'learning_rate': 8.468468468468469e-05, 'epoch': 1.21}{'loss': 0.2645, 'grad_norm': 0.5164972543716431, 'learning_rate': 7.567567567567568e-05, 'epoch': 1.29}{'loss': 0.2584, 'grad_norm': 0.5759400725364685, 'learning_rate': 6.666666666666667e-05, 'epoch': 1.38}{'loss': 0.2121, 'grad_norm': 0.5618821978569031, 'learning_rate': 5.765765765765766e-05, 'epoch': 1.47}{'loss': 0.2322, 'grad_norm': 0.5534489154815674, 'learning_rate': 4.8648648648648654e-05, 'epoch': 1.55}{'loss': 0.2256, 'grad_norm': 0.6181885600090027,'learning_rate': 3.963963963963964e-05, 'epoch': 1.64}{'loss': 0.1841, 'grad_norm': 0.48197486996650696, 'learning_rate': 3.063063063063063e-05, 'epoch': 1.72}{'loss': 0.2789, 'grad_norm': 0.6069267988204956, 'learning_rate': 2.1621621621621624e-05, 'epoch': 1.81}{'loss': 0.2148, 'grad_norm': 0.5475031137466431, 'learning_rate': 1.2612612612612611e-05, 'epoch': 1.9}{'loss': 0.2263, 'grad_norm': 0.6717495918273926, 'learning_rate': 3.603603603603604e-06, 'epoch': 1.98}{'train_runtime': 47.1361, 'train_samples_per_second': 2.461, 'train_steps_per_second': 2.461, 'train_loss': 0.35218508824192246, 'epoch': 2.0}----- model.dtype of the trained model: torch.bfloat16
----- Format follows the pre-fine-tuning training to complete, check whether the model has learned to follow our custom format -----
You are given a problem.
Think about the problem and provide your working out.
Place it between < start_working_out > and < end_working_out > .
Then , provide your solution between < SOLUTION > and </ SOLUTION ><| endoftext | > Jenifer has 82 cents in pennies and nickels . Her younger brother mistook all her nickels for dimes and counted the total as $ 1 . 47 . |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad | ><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad | > <| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |>< | vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |><| vision_pad |> ol let 's start by setting up the problem. Let 's denote the number of pennies as \( p \) and the number of nickels as \( n \). We know that the total value of the coins is 82 cents, so we can write the equation:
\[ p + 5n = 82 \]
Next , we need to consider the mistake made by Jenifer 's younger brother. He mistook all the nickels for dimes, so he counted the total as $1 . 47 , which is 147 cents. This gives us another equation:
\[ p + 10n = 147 \]
Now , we have a system of two equations:
1. \(p + 5n = 82 \)
2. \(p + 10n = 147 \)
To solve this system, we can subtract the first equation from the second to eliminate \( p \):
\[ (p + 10n) - (p + 5n) = 147 - 82 \]
\[ 5n = 65 \]
\[ n = 13 \]
Now that we have \( n = 13 \), we can substitute this value back into the first equation to find \( p \):
\[ p + 5 ( 13 ) = 82 \]
\[ p + 65 = 82 \]
\[ p = 17 \]
So , Jenifer has 17 pennies. Let 's verify the solution:
- The value of 17 pennies is \( 17 \times 1 = 17 \) cents.
- The value of 13 nickels is \( 13 \times 5 = 65 \) cents.
- The total value is \( 17 + 65 = 82 \) cents, which matches the given total.
Thus , the number of pennies Jenifer has is \(\boxed{ 17 }\) .< end_working_out >< SOLUTION > 17 </ SOLUTION ><| endoftext |>
As you can see, the model output after pre-fine-tuning meets expectations. The reasoning process is placed between the specified tags <start_working_out> and <end_working_out>, and the final answer is placed between the specified tags. (<start_working_out> will be added to the dialogue question to guide the model to output reasoning. Therefore, the model does not output this start tag, but directly outputs the thinking content, and then ends the thinking with <end_working_out>).
Processing the GRPO dataset
===== step4. Load and process the dataset =============================================================----- Dataset DAPO-Math-17k-Processed:Dataset({ features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'], num_rows: 14116})----- The first prompt: In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.----- Solution of the first item: 34Map: 100%|██████████| 14116/14116 [00:01<00:00, 11664.38 examples/s]----- Content of the 1st conversation format: {'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION> and </SOLUTION>', 'role': 'system'}, {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.', 'role': 'user'}], 'solution': '34', 'data_source': 'math_dapo', 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.\n\nRemember to put your answer on its own line after "Answer:".', 'role': 'user'}], 'ability': 'MATH', 'reward_model': {'ground_truth': '34', 'style': 'rule-lighteval/MATH_v2'}, 'extra_info': {'index': '9a9b6eb4-a1cb-49d1-8c1e-62eaf2f74079'}, 'answer': '34'}Map: 100%|██████████| 14116/14116 [00:04<00:00, 3005.13 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then,provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$. [00:02<00:00, 5114.02 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>Map: 100%|██████████| 14116/14116 [00:02<00:00, 5114.02 examples/s]==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 12,709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)Max Length = 203709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)Max Length = 203
Define and test the reward function
Four reward functions are defined in the training script. The following are scoring examples for two of them:
===== step5. Define and test reward function ==================================================================
match_format:re.compile(' < end_working_out > .*? < SOLUTION > (.+?) </ SOLUTION > [\\s]{0,}(?:<\\|endoftext\\|>)?[\\s]{0,}$', re.MULTILINE|re.DOTALL)
----- Reward function check_answer scoring example:
Case | Question | Response | Answer | Extracted | Score
------------------------------------------------------------------------------------------
1 | Q: 2+2 = ? | Let me think! < end_working_out >< SOLUTION > 4 </ SOLUTION > | 4 | 4 | 5.0
2 | Q: Hello? | < start_working_out > think.. < end_working_out >< SOLUTION > yes </ SOLUTION > | yes | yes | 3.5
3 | Q: Value? | ! < end_working_out >< SOLUTION > 9.5 </ SOLUTION > | 10 | 9.5 | 2.0
4 | Q: Value? | ! < end_working_out >< SOLUTION > 8.3 </ SOLUTION > | 10 | 8.3 | 1.5
5 | Q: Value? | i! < end_working_out >< SOLUTION > 5 </ SOLUTION > | 10 | 5 | -2.5
6 | Q: Answer? | i! < end_working_out >< SOLUTION > no digit </ SOLUTION > | 42 | no digit | -4.5
7 | Q: String? | f! < end_working_out >< SOLUTION > oobar </ SOLUTION > | baz | oobar | -4.5
----- Reward function check_numbers scoring example:
Case | Question | Response(s) | Answer(s) | Score(s)
--------------------------------------------------------------------------------------
1 | Q: 2+2=? | < SOLUTION > 4 | 4 | 3.5
2 | Q: Total amount? | < SOLUTION > 1,234.00 | 1234.0 | 3.5
3 | Q: Output | < SOLUTION > No number | 0 | -2.5
4 | Q: 10-3=? | < SOLUTION > 5 | 7 | -1.5
5 | Q: 1+1=?,2+2=? | < SOLUTION > 2,4 | 2,4 | 3.5,-2.5
Conduct GRPO training
===== step6. Training model========================================================================
. . .
5%|▌ | 5/100 [02:19<40:11, 25.38s/it]************************Question:
. . .
100%|██████████| 100/100 [46:13<00:00, 27.74s/it]can fire at a single
Extracted:
None
{ 'loss' : 0.0067, 'grad_norm' : 0.2612026631832123, 'learning_rate' : 2.777777777777776e-07, 'rewards/match_format_exactly' : 2.25, 'rewards/match_format_approximately' : 0.375, 'rewards/check_answer' : 3.25, 'rewards/check_numbers' : 2.0, 'reward' : 7.875, 'reward_std' : 10.25, 'completion_length' : 1379.25, 'kl' : 0.1681276112794876, 'epoch' : 0.01}
{ 'loss' : 0.0052, 'grad_norm' : 0.22037602961063385, 'learning_rate' : 2.222222222222224e-07, 'rewards/match_format_exactly' : 1.5, 'rewards/match_format_approximately' : -0.75, 'rewards/check_answer' : 1.5, 'rewards/check_numbers' : 0.5, 'reward' : 2.75, 'reward_std' : 11.83568000793457, 'completion_length' : 1665.75, 'kl' : 0.1294582337141037, 'epoch' : 0.01}
{ 'loss' : 0.0043, 'grad_norm' : 0.18991202116012573, 'learning_rate' : 1.66666666666668e-07, 'rewards/match_format_exactly' : 0.75, 'rewards/match_format_approximately' : -1.875, 'rewards/check_answer' : -0.25, 'rewards/check_numbers' : -1.0, 'reward' : -2.375, 'reward_std' : 10.25, 'completion_length' : 1775.5, 'kl' : 0.1086689755320549, 'epoch' : 0.01}
{ 'loss' : 0.0046, 'grad_norm' : 0.025854697450995445, 'learning_rate' : 1.111111111111112e-07, 'rewards/match_format_exactly' : 0.0, 'rewards/match_format_approximately' : -3.0, 'rewards/check_answer' : -2.0, 'rewards/check_numbers' : -2.5, 'reward' : -7.5, 'reward_std' : 0.0, 'completion_length' : 1844.0, 'kl' : 0.11534835398197174, 'epoch' : 0.01}
{ 'loss' : 0.0052, 'grad_norm' : 0.05398529767990112, 'learning_rate' : 5.55555555555556e-08, 'rewards/match_format_exactly' : 0.0, 'rewards/match_format_approximately' : -3.0, 'rewards/check_answer' : -2.0, 'rewards/check_numbers' : -2.5, 'reward' : -7.5, 'reward_std' : 0.0, 'completion_length' : 1844.0, 'kl' : 0.1298024207353592, 'epoch' : 0.01}
{ 'train_runtime' : 2773.8914, 'train_samples_per_second' : 0.144, 'train_steps_per_second' : 0.036, 'train_loss' : 0.005772246685810387, 'epoch' : 0.01}
Here we set max_steps = 100 to complete the test as quickly as possible. During formal training, you can control the number of training rounds by setting epochs and end the training early based on the loss convergence.
Testing the trained model
Processed prompts: 100%|██████████| 1/1 [00:11<00:00, 11.75s/it, est. speed input: 0.85 toks/s, output: 87.15 toks/s]----- Answers from the basic model: - AnswersMath and ArithmeticWhat is the sqrt of 101?Wiki User∙ 2010-05-29 22:38:13Best AnswerCopyThe square root of 101 is 10.0498756 approximatelyWiki User∙ 2010-05-29 22:38:13This answer is:?0?0?0What is the square root of 101?It is approx. root of -101?The square root of -101 can be written as the product of the positive square root of 101 and i (where i is an imaginary number). The square root of 101 is approximately 10.04987751.What is the square root of 101 simplified?sqrt(101) is already simplified since 101 is not a perfect square. Also, we cannot simplify it since 101 is a prime number. (In other words, 101 = 1 x 101, so its only factorization is 1 and 101) In decimal form it is: 10.049875621120891586572919348985505109596599416484785647300807046What is the square root of 20200?sqrt (20200) = sqrt (4 x 100 x 505) = What is the square root of 3 sqrt (4) x sqrt (100) x sqrt (505) = 2 x 10 x sqrt (505) = 20 x sqrt (505) = 10 x sqrt (4) in 101?sqrt(3 in 101) = sqrt(101) x sqrt(3) = sqrt(101 x 3) = sqrt(303) = 17.4069...approx.What is an irrational number?-101 as a number.Why is one sixth of one third the same as one square root of one hundred sixty nine?sqrt(169) = 13 1/6 of 1/3 = (1/6)(1/3) = 1/(6x3) = 1/18 = (1/13)(1/13) = 1/sqrt(169)What number when squared equals six?If you meant sqrt(6)2 then this = 6 and sqrt(6) = 2.4494... For the number to be a square root you need the 6 to be in the denominator or the square root of 6.What is -20 sqrt of 101?-20 square root of (101) - 20 * sqrt(101) - 20 * sqrt(101) is a real number and cannot be simplified any further.What is the square root of 101.8?9.046921979920897...What is the square root of 7561?As sqrt(7561) = sqrt(169)*sqrt(41) = 13*sqrt(41) ~= 274.84779...What is the sqrt of 0.25?The sqrt of 0.25 is 0.5What is the square root of 169 over 10?It is 1.3How do you find the square root of 51?The square root of 51 is approx. 7.1414 The easiest way to do that is to use a calculator. If you do not have a calculator, I strongly suggest using one, since the sqrt of 51 is an irrational number with an infinite amount of decimal places. Square and cube roots can be calculated the old-fashioned manner by using trial and error. 72 = 49 which is too small; 82 = 64 which is too big, etc. If you need to go that route, you need to know your basicProcessed prompts: 100%|██████████| 1/1 [00:26<00:00, 26.65s/it, est. speed input: 2.25 toks/s, output: 74.61 toks/s]
----- GRPO-LoRA model's answer:
Okay, so I need to find the square root of 101. Hmm, let me think. The square root of a number is the value that, when multiplied by itself, gives the original number. But 101 seems like it 's not a perfect square, right? I remember that perfect squares like 100, 121 , 144, etc., are numbers that have exact square roots since they' re squares of integers. 100 , which is 10 squared. So maybe √ 101 is close to 10 but not exactly 10. Let me calculate 10 squared first. 10 × 10 is 100. So √ 101 is a little more than 10. How much more? Well, 10.5 squared is 110.25 , which is higher than 101. So it has to be between 10 and 10.5 . Maybe 10.05 ? Let me try that . 10.05 squared is 10.05 × 10.05 . Let me compute that . 100 + 0.5 + 0.5 + 0.0025 = 101.0025 . That 's very close to 101, so √101 is approximately 10.05. But wait, 10.04 squared might be slightly less. Let me check 10.04: 10.04 × 10.04. This is 10.04 × 10 = 100.4, then 0.04 × 10 = 0.4, and 0.04×0.04 = 0.0016. So total: 100.4 + 0.4 = 100.8, plus 0.0016, which is 100.8016. That's less than 101. So √ 101 is between 10.04 and 10.05 . Maybe 10.045 ? Let me try 10.045 : 10.045 squared. Hmm, 10 × 10 = 100 , 10 × 0.045 So it ' s closer to 10.04. Maybe 10.042? Let me try 10.042: 10.042 squared. 10×10=100, 10×0.042=0.42, 0.042×10=0.42, 0.042×0.042=0.001764. Adding: 100 + 0.42 + 0.42 = 100.84, plus 0.001764 is 100.841764, which is still below 101. So it's closer to 10.043 . Let me try 10.043 : 10.043 squared. 10 × 10 = 100 , 10 × 0.043 = 0.43 , 0.043 × 10 = 0.43 , 0.043 × 0.043 = 0.001849 . Adding: 100 + 0.43 + 0.43 = 100.86 , plus 0.001849 is 100.861849 , still less than 101. So √ 101 is approximately 10.043 . But I wonder if there 's a better way. Maybe using the Newton-Raphson method for square roots? Let me recall that method. Let x be the number we want to find the square root of. Start with an initial guess, say 10. Then the method is x_(new) = (x + N/x)/2, where N is the number, which is 101. So first guess is 10. 20.1/2 = 10.05. Next iteration: (10.05 + 101/10.05)/2. 101 divided by 10.05 is approximately 10.049751. So 10.05 + 10.049751 = 20.099751, divided by 2 is 10.049875. the approximation is approximately 10.049875. That's more accurate than my previous guesses. So the square root of 101 is approximately 10.049875 . But the problem didn 't specify how precise the answer needs to be, so maybe just the decimal approximation is acceptable. So let me express that in decimal form. 10.049875... Hmm, four decimal places would be 10.0499. But to be precise, I should keep more. So maybe 10.049875? But that' sa bit too far. Let me verify with the Newton-Raphson method again. Start with 10. ( 10 + 101/ 10 )/ 2 = 10.05 . Then ( 10.05 + 101 / 10.05 ) / 2. 101 / 10.05 is 10.04975 . So 10.05 + 10.04975 = 20.09975 . Divided by 2 is 10.049875 . Then next iteration: ( 10.049875 + 101 / 10.049875 )/ 2. 101 divided by 10.049875 is approximately 10.049874 . So 10.049875 + 10.049874 = 20.099749 , divided by 2 is 10.0498745 . So after three iterations, it 's approximately 10.0498745. So the square root of 101 is approximately 10.049875. Therefore, I think the answer is around 10.05. But for a better approximation, maybe using more iterations or a calculator. But since the problem doesn' t specify, I 'll go with 10.05. Let me check if 10.05 squared is 101. 10.05 × 10.05. Let me multiply that out. 10×10=100, 10×0.05=0.5, 0.05×10=0.5, 0.05×0.05=0.0025. Adding: 100 + 0.5 + 0.5 = 101, plus 0.0025 is 101.0025. So √101 is slightly less than 10.05. Therefore, the square root is approximately 10.049875. So in decimal form, that's about 10.050 . Therefore, the square root of 101 is approximately 10.05 .
To find the square root of 101 , we can use the Newton-Raphson method for approximation. The method starts with an initial guess and iteratively refines it.
1. **Initial Guess**: Start with \( x_0
The answer here is truncated due to length issues. This is because the test code in this article sets the maximum number of generated tokens, max_tokens = 1024, during inference; these issues have been resolved in the latest training code. See:
"One-click GRPO fine-tuning of the training code for the latest Qwen3 model on a single 4090 card"
Merge and save models
After Lora training is completed, the original model will not be changed, nor will a completely new model be generated. Instead, an additional smaller Lora weight is generated, which contains the trained content. In the above test, this Lora weight is used as a plug-in and loaded together with the original weight for inference testing. After the test is completed, we need to merge the plug-in lora weight with the original model to generate a new complete model.
===== step8. Merging and saving model ============================================================= Unsloth: Merging 4bit and LoRA weights to 16bit...Unsloth: Will use up to 336.32 out of 503.72 RAM for saving.Unsloth: Saving model... This might take 5 minutes... 17%|█▋ | 6/36 [00:00<00:00, 57.02it/s]We will save to Disk and not RAM now.100%|██████████| 36/36 [00:05<00:00, 6.11it/s]Unsloth: Saving tokenizer... Done.Done.
The obtained model can be further quantified as needed.