Commit
·
bd267e6
1
Parent(s):
e64b706
Add GGUF models + tokenizer with LFS
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +2 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_code.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_general.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_math.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt +175 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt +175 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt +175 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B F16 | 15.26 GiB | 8.19 B | CUDA | 35 | pp8 | 157.31 ± 1.63 |
|
| 9 |
+
| qwen3vl 8B F16 | 15.26 GiB | 8.19 B | CUDA | 35 | tg128 | 20.52 ± 0.02 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21539 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/Qwen3-VL-8B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 35 |
+
llama_model_loader: - kv 24: qwen3vl.n_deepstack_layers u32 = 3
|
| 36 |
+
llama_model_loader: - kv 25: general.quantization_version u32 = 2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 151643
|
| 45 |
+
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = F16
|
| 51 |
+
print_info: file size = 15.26 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 15623.18 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 3680.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 3680.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 112.248 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 2.97 seconds per pass - ETA 2.17 minutes
|
| 162 |
+
[1]2.5123,[2]2.0010,[3]1.5894,[4]1.4775,[5]1.5704,[6]1.6209,[7]1.5889,[8]1.5728,[9]1.5112,[10]1.4743,[11]1.4488,[12]1.4529,[13]1.4287,[14]1.4135,[15]1.4232,[16]1.4071,[17]1.3931,[18]1.3946,[19]1.3839,[20]1.3695,[21]1.3630,[22]1.3613,[23]1.3810,[24]1.3722,[25]1.3768,[26]1.3649,[27]1.3579,[28]1.3558,[29]1.3692,[30]1.3706,[31]1.3622,[32]1.3550,[33]1.3559,[34]1.3544,[35]1.3532,[36]1.3756,[37]1.3849,[38]1.3897,[39]1.3960,[40]1.3964,[41]1.3918,[42]1.4045,[43]1.4042,[44]1.4053,
|
| 163 |
+
Final estimate: PPL = 1.4053 +/- 0.00874
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 4057.43 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 116514.69 ms / 90112 tokens ( 1.29 ms per token, 773.40 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 117753.81 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 16087 + ( 5252 = 3680 + 80 + 1491) + 2767 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19612 + ( 3858 = 3680 + 80 + 98) + 653 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 15763 = 15623 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21539 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/Qwen3-VL-8B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 35 |
+
llama_model_loader: - kv 24: qwen3vl.n_deepstack_layers u32 = 3
|
| 36 |
+
llama_model_loader: - kv 25: general.quantization_version u32 = 2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 151643
|
| 45 |
+
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = F16
|
| 51 |
+
print_info: file size = 15.26 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 15623.18 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 3680.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 3680.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 47.641 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 2.90 seconds per pass - ETA 0.72 minutes
|
| 162 |
+
[1]6.9501,[2]8.4922,[3]8.9206,[4]8.5067,[5]8.2771,[6]7.0818,[7]6.4316,[8]6.4697,[9]6.7493,[10]6.8592,[11]6.8594,[12]7.1646,[13]7.2302,[14]7.3576,[15]7.4343,
|
| 163 |
+
Final estimate: PPL = 7.4343 +/- 0.15664
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1922.95 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 40363.40 ms / 30720 tokens ( 1.31 ms per token, 761.09 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 40848.59 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 16087 + ( 5252 = 3680 + 80 + 1491) + 2767 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19612 + ( 3858 = 3680 + 80 + 98) + 653 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 15763 = 15623 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21539 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/Qwen3-VL-8B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 35 |
+
llama_model_loader: - kv 24: qwen3vl.n_deepstack_layers u32 = 3
|
| 36 |
+
llama_model_loader: - kv 25: general.quantization_version u32 = 2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 151643
|
| 45 |
+
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = F16
|
| 51 |
+
print_info: file size = 15.26 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 15623.18 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 3680.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 3680.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 46.836 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 2.90 seconds per pass - ETA 0.77 minutes
|
| 162 |
+
[1]5.0178,[2]5.4503,[3]5.6477,[4]5.7466,[5]5.9143,[6]5.8806,[7]5.8450,[8]5.7878,[9]5.8197,[10]5.8095,[11]5.8265,[12]5.8087,[13]5.8727,[14]5.8747,[15]5.8583,[16]5.8563,
|
| 163 |
+
Final estimate: PPL = 5.8563 +/- 0.10809
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1984.68 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 42289.09 ms / 32768 tokens ( 1.29 ms per token, 774.86 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 42743.27 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 16087 + ( 5252 = 3680 + 80 + 1491) + 2767 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19612 + ( 3858 = 3680 + 80 + 98) + 653 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 15763 = 15623 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 9.72 GiB | 8.19 B | CUDA | 35 | pp8 | 174.85 ± 3.34 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 9.72 GiB | 8.19 B | CUDA | 35 | tg128 | 25.45 ± 0.09 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 38 tensors
|
| 49 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 9.72 GiB (10.19 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 5742.53 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 2105.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 2105.32 MiB
|
| 126 |
+
...............................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 117.59 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.11 seconds per pass - ETA 1.53 minutes
|
| 163 |
+
[1]2.5106,[2]2.0036,[3]1.5907,[4]1.4787,[5]1.5706,[6]1.6214,[7]1.5892,[8]1.5729,[9]1.5112,[10]1.4742,[11]1.4488,[12]1.4530,[13]1.4288,[14]1.4133,[15]1.4233,[16]1.4072,[17]1.3932,[18]1.3947,[19]1.3841,[20]1.3697,[21]1.3631,[22]1.3614,[23]1.3811,[24]1.3723,[25]1.3770,[26]1.3651,[27]1.3581,[28]1.3559,[29]1.3693,[30]1.3706,[31]1.3623,[32]1.3551,[33]1.3560,[34]1.3544,[35]1.3533,[36]1.3756,[37]1.3849,[38]1.3898,[39]1.3960,[40]1.3964,[41]1.3917,[42]1.4045,[43]1.4043,[44]1.4053,
|
| 164 |
+
Final estimate: PPL = 1.4053 +/- 0.00873
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1533.31 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 81963.84 ms / 90112 tokens ( 0.91 ms per token, 1099.41 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 83204.60 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17698 + (3677 = 2105 + 80 + 1491) + 2731 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21188 + (2283 = 2105 + 80 + 98) + 652 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 5882 = 5742 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 38 tensors
|
| 49 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 9.72 GiB (10.19 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 5742.53 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 2105.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 2105.32 MiB
|
| 126 |
+
...............................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 48.93 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.14 seconds per pass - ETA 0.53 minutes
|
| 163 |
+
[1]6.9506,[2]8.4900,[3]8.9135,[4]8.5079,[5]8.2866,[6]7.0919,[7]6.4409,[8]6.4713,[9]6.7514,[10]6.8654,[11]6.8653,[12]7.1736,[13]7.2419,[14]7.3681,[15]7.4452,
|
| 164 |
+
Final estimate: PPL = 7.4452 +/- 0.15691
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1545.44 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 28151.75 ms / 30720 tokens ( 0.92 ms per token, 1091.23 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 28583.01 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17701 + (3677 = 2105 + 80 + 1491) + 2728 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21188 + (2283 = 2105 + 80 + 98) + 652 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 5882 = 5742 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 38 tensors
|
| 49 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 9.72 GiB (10.19 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 5742.53 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 2105.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 2105.32 MiB
|
| 126 |
+
...............................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 48.909 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.13 seconds per pass - ETA 0.57 minutes
|
| 163 |
+
[1]5.0161,[2]5.4576,[3]5.6584,[4]5.7568,[5]5.9256,[6]5.8930,[7]5.8575,[8]5.7991,[9]5.8294,[10]5.8181,[11]5.8352,[12]5.8166,[13]5.8805,[14]5.8819,[15]5.8647,[16]5.8631,
|
| 164 |
+
Final estimate: PPL = 5.8631 +/- 0.10827
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1530.77 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 30397.68 ms / 32768 tokens ( 0.93 ms per token, 1077.98 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 30863.63 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17699 + (3677 = 2105 + 80 + 1491) + 2730 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21188 + (2283 = 2105 + 80 + 98) + 652 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 5882 = 5742 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.24 GiB | 8.19 B | CUDA | 35 | pp8 | 275.90 ± 9.88 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.24 GiB | 8.19 B | CUDA | 35 | tg128 | 46.54 ± 0.18 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.24 GiB (7.60 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3668.22 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1875.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1875.32 MiB
|
| 126 |
+
.............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 638.59 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 115.181 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.82 seconds per pass - ETA 1.33 minutes
|
| 163 |
+
[1]2.5461,[2]2.0222,[3]1.6008,[4]1.4877,[5]1.5820,[6]1.6363,[7]1.6017,[8]1.5841,[9]1.5209,[10]1.4820,[11]1.4562,[12]1.4609,[13]1.4362,[14]1.4201,[15]1.4293,[16]1.4126,[17]1.3983,[18]1.3997,[19]1.3886,[20]1.3739,[21]1.3673,[22]1.3653,[23]1.3857,[24]1.3769,[25]1.3814,[26]1.3694,[27]1.3625,[28]1.3608,[29]1.3744,[30]1.3759,[31]1.3674,[32]1.3600,[33]1.3613,[34]1.3596,[35]1.3582,[36]1.3812,[37]1.3905,[38]1.3957,[39]1.4021,[40]1.4025,[41]1.3977,[42]1.4111,[43]1.4109,[44]1.4119,
|
| 164 |
+
Final estimate: PPL = 1.4119 +/- 0.00885
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1313.68 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 70469.89 ms / 90112 tokens ( 0.78 ms per token, 1278.73 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 71650.96 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18884 + (2593 = 1875 + 80 + 638) + 2628 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21430 + (2053 = 1875 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3808 = 3668 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.24 GiB (7.60 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3668.22 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1875.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1875.32 MiB
|
| 126 |
+
.............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 638.59 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 47.695 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.86 seconds per pass - ETA 0.45 minutes
|
| 163 |
+
[1]7.1465,[2]8.6407,[3]9.1399,[4]8.7235,[5]8.4601,[6]7.2164,[7]6.5411,[8]6.5728,[9]6.8688,[10]6.9913,[11]6.9949,[12]7.3083,[13]7.3732,[14]7.5144,[15]7.5898,
|
| 164 |
+
Final estimate: PPL = 7.5898 +/- 0.16063
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1351.48 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 24189.03 ms / 30720 tokens ( 0.79 ms per token, 1270.00 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 24604.03 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18881 + (2593 = 1875 + 80 + 638) + 2631 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21430 + (2053 = 1875 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3808 = 3668 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21576 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.24 GiB (7.60 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3668.22 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1875.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1875.32 MiB
|
| 126 |
+
.............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 638.59 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 44.442 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.79 seconds per pass - ETA 0.47 minutes
|
| 163 |
+
[1]5.1022,[2]5.5387,[3]5.7284,[4]5.8184,[5]6.0015,[6]5.9571,[7]5.9217,[8]5.8574,[9]5.8853,[10]5.8751,[11]5.8916,[12]5.8752,[13]5.9419,[14]5.9401,[15]5.9276,[16]5.9246,
|
| 164 |
+
Final estimate: PPL = 5.9246 +/- 0.10958
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1398.54 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 25856.18 ms / 32768 tokens ( 0.79 ms per token, 1267.32 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 26290.33 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18882 + (2593 = 1875 + 80 + 638) + 2630 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21430 + (2053 = 1875 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3808 = 3668 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.46 GiB | 8.19 B | CUDA | 35 | pp8 | 268.54 ± 8.59 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.46 GiB | 8.19 B | CUDA | 35 | tg128 | 44.96 ± 0.47 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q5_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.46 GiB (7.82 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3848.59 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1895.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1895.32 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 712.78 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 111.939 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.83 seconds per pass - ETA 1.33 minutes
|
| 163 |
+
[1]2.5224,[2]2.0051,[3]1.5915,[4]1.4799,[5]1.5723,[6]1.6226,[7]1.5896,[8]1.5734,[9]1.5111,[10]1.4738,[11]1.4484,[12]1.4525,[13]1.4284,[14]1.4130,[15]1.4228,[16]1.4070,[17]1.3932,[18]1.3948,[19]1.3837,[20]1.3692,[21]1.3627,[22]1.3609,[23]1.3808,[24]1.3720,[25]1.3765,[26]1.3647,[27]1.3577,[28]1.3556,[29]1.3691,[30]1.3706,[31]1.3622,[32]1.3550,[33]1.3560,[34]1.3544,[35]1.3533,[36]1.3759,[37]1.3852,[38]1.3901,[39]1.3964,[40]1.3968,[41]1.3922,[42]1.4052,[43]1.4050,[44]1.4058,
|
| 164 |
+
Final estimate: PPL = 1.4058 +/- 0.00872
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1353.95 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 72628.10 ms / 90112 tokens ( 0.81 ms per token, 1240.73 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 73829.72 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18787 + (2688 = 1895 + 80 + 712) + 2631 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21410 + (2073 = 1895 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3988 = 3848 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q5_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.46 GiB (7.82 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3848.59 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1895.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1895.32 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 712.78 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 47.074 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.83 seconds per pass - ETA 0.45 minutes
|
| 163 |
+
[1]7.0320,[2]8.5604,[3]9.0161,[4]8.5972,[5]8.3645,[6]7.1528,[7]6.4874,[8]6.5161,[9]6.7994,[10]6.9151,[11]6.9091,[12]7.2185,[13]7.2796,[14]7.4107,[15]7.4899,
|
| 164 |
+
Final estimate: PPL = 7.4899 +/- 0.15780
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1363.37 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 24536.66 ms / 30720 tokens ( 0.80 ms per token, 1252.00 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 24945.45 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18788 + (2688 = 1895 + 80 + 712) + 2630 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21410 + (2073 = 1895 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3988 = 3848 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q5_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.46 GiB (7.82 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3848.59 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1895.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1895.32 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 712.78 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 47.786 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.84 seconds per pass - ETA 0.48 minutes
|
| 163 |
+
[1]5.0420,[2]5.4883,[3]5.6921,[4]5.7939,[5]5.9530,[6]5.9219,[7]5.8762,[8]5.8166,[9]5.8427,[10]5.8306,[11]5.8530,[12]5.8371,[13]5.9020,[14]5.9012,[15]5.8880,[16]5.8853,
|
| 164 |
+
Final estimate: PPL = 5.8853 +/- 0.10873
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1357.45 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 26131.24 ms / 32768 tokens ( 0.80 ms per token, 1253.98 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 26564.25 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18784 + (2688 = 1895 + 80 + 712) + 2633 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21410 + (2073 = 1895 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3988 = 3848 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.69 GiB | 8.19 B | CUDA | 35 | pp8 | 264.49 ± 5.33 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.69 GiB | 8.19 B | CUDA | 35 | tg128 | 41.62 ± 0.17 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q6_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.69 GiB (8.06 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 4040.24 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1916.57 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1916.57 MiB
|
| 126 |
+
..........................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 791.61 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 110.893 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.85 seconds per pass - ETA 1.35 minutes
|
| 163 |
+
[1]2.5108,[2]2.0019,[3]1.5898,[4]1.4778,[5]1.5709,[6]1.6221,[7]1.5899,[8]1.5736,[9]1.5116,[10]1.4747,[11]1.4491,[12]1.4534,[13]1.4292,[14]1.4137,[15]1.4236,[16]1.4074,[17]1.3935,[18]1.3949,[19]1.3841,[20]1.3697,[21]1.3631,[22]1.3615,[23]1.3812,[24]1.3724,[25]1.3771,[26]1.3652,[27]1.3581,[28]1.3561,[29]1.3695,[30]1.3710,[31]1.3626,[32]1.3553,[33]1.3563,[34]1.3547,[35]1.3536,[36]1.3760,[37]1.3853,[38]1.3902,[39]1.3964,[40]1.3968,[41]1.3922,[42]1.4050,[43]1.4048,[44]1.4057,
|
| 164 |
+
Final estimate: PPL = 1.4057 +/- 0.00873
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1371.62 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 73112.97 ms / 90112 tokens ( 0.81 ms per token, 1232.50 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 74384.48 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18688 + (2788 = 1916 + 80 + 791) + 2629 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21388 + (2094 = 1916 + 80 + 98) + 641 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 4180 = 4040 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21570 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q6_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.69 GiB (8.06 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 4040.24 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1916.57 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1916.57 MiB
|
| 126 |
+
..........................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 791.61 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 53.802 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.87 seconds per pass - ETA 0.47 minutes
|
| 163 |
+
[1]6.9562,[2]8.4983,[3]8.9296,[4]8.5223,[5]8.2998,[6]7.0994,[7]6.4443,[8]6.4736,[9]6.7542,[10]6.8665,[11]6.8674,[12]7.1743,[13]7.2414,[14]7.3696,[15]7.4477,
|
| 164 |
+
Final estimate: PPL = 7.4477 +/- 0.15673
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1388.47 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 25116.55 ms / 30720 tokens ( 0.82 ms per token, 1223.10 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 25560.37 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18684 + (2788 = 1916 + 80 + 791) + 2633 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21388 + (2094 = 1916 + 80 + 98) + 641 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 4180 = 4040 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q6_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.69 GiB (8.06 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 4040.24 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1916.57 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1916.57 MiB
|
| 126 |
+
..........................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 791.61 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 44.404 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.87 seconds per pass - ETA 0.48 minutes
|
| 163 |
+
[1]5.0090,[2]5.4459,[3]5.6503,[4]5.7510,[5]5.9168,[6]5.8860,[7]5.8513,[8]5.7927,[9]5.8249,[10]5.8130,[11]5.8316,[12]5.8131,[13]5.8761,[14]5.8777,[15]5.8629,[16]5.8615,
|
| 164 |
+
Final estimate: PPL = 5.8615 +/- 0.10817
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1385.16 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 26813.84 ms / 32768 tokens ( 0.82 ms per token, 1222.06 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 27244.56 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18688 + (2788 = 1916 + 80 + 791) + 2629 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21388 + (2094 = 1916 + 80 + 98) + 641 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 4180 = 4040 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 8.11 GiB | 8.19 B | CUDA | 35 | pp8 | 224.83 ± 11.50 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 8.11 GiB | 8.19 B | CUDA | 35 | tg128 | 35.86 ± 0.22 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 8.11 GiB (8.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 4389.72 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1955.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1955.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 935.34 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 113.764 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.94 seconds per pass - ETA 1.42 minutes
|
| 162 |
+
[1]2.5086,[2]2.0027,[3]1.5903,[4]1.4778,[5]1.5697,[6]1.6206,[7]1.5885,[8]1.5722,[9]1.5105,[10]1.4736,[11]1.4482,[12]1.4524,[13]1.4284,[14]1.4129,[15]1.4230,[16]1.4069,[17]1.3930,[18]1.3945,[19]1.3839,[20]1.3695,[21]1.3630,[22]1.3614,[23]1.3813,[24]1.3725,[25]1.3770,[26]1.3651,[27]1.3580,[28]1.3560,[29]1.3693,[30]1.3709,[31]1.3626,[32]1.3553,[33]1.3563,[34]1.3547,[35]1.3536,[36]1.3760,[37]1.3852,[38]1.3901,[39]1.3964,[40]1.3968,[41]1.3921,[42]1.4049,[43]1.4047,[44]1.4057,
|
| 163 |
+
Final estimate: PPL = 1.4057 +/- 0.00874
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1370.66 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 74980.38 ms / 90112 tokens ( 0.83 ms per token, 1201.81 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 76183.00 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18506 + (2970 = 1955 + 80 + 935) + 2629 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21350 + (2133 = 1955 + 80 + 98) + 640 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 4529 = 4389 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 8.11 GiB (8.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 4389.72 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1955.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1955.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 935.34 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 51.033 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.97 seconds per pass - ETA 0.48 minutes
|
| 162 |
+
[1]6.9869,[2]8.5129,[3]8.9244,[4]8.5289,[5]8.3002,[6]7.1014,[7]6.4486,[8]6.4795,[9]6.7587,[10]6.8718,[11]6.8729,[12]7.1788,[13]7.2458,[14]7.3735,[15]7.4515,
|
| 163 |
+
Final estimate: PPL = 7.4515 +/- 0.15701
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1382.45 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 25824.04 ms / 30720 tokens ( 0.84 ms per token, 1189.59 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 26242.73 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18505 + (2970 = 1955 + 80 + 935) + 2631 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21350 + (2133 = 1955 + 80 + 98) + 640 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 4529 = 4389 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 8.11 GiB (8.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 4389.72 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1955.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1955.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 935.34 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 44.391 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.94 seconds per pass - ETA 0.52 minutes
|
| 162 |
+
[1]5.0078,[2]5.4396,[3]5.6464,[4]5.7479,[5]5.9166,[6]5.8832,[7]5.8506,[8]5.7907,[9]5.8240,[10]5.8132,[11]5.8323,[12]5.8145,[13]5.8774,[14]5.8792,[15]5.8645,[16]5.8639,
|
| 163 |
+
Final estimate: PPL = 5.8639 +/- 0.10825
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1420.00 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 28124.87 ms / 32768 tokens ( 0.86 ms per token, 1165.09 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 28614.04 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18504 + (2970 = 1955 + 80 + 935) + 2631 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21350 + (2133 = 1955 + 80 + 98) + 640 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 4529 = 4389 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.21 GiB | 8.19 B | CUDA | 35 | pp8 | 274.38 ± 8.11 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.21 GiB | 8.19 B | CUDA | 35 | tg128 | 45.70 ± 0.61 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21434 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 50 |
+
llama_model_loader: - type mxfp4: 37 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 7.21 GiB (7.56 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3vl
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 4096
|
| 66 |
+
print_info: n_embd_inp = 16384
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 12288
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = 0
|
| 91 |
+
print_info: rope type = 40
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 98 |
+
print_info: model type = 8B
|
| 99 |
+
print_info: model params = 8.19 B
|
| 100 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 101 |
+
print_info: vocab type = BPE
|
| 102 |
+
print_info: n_vocab = 151936
|
| 103 |
+
print_info: n_merges = 151387
|
| 104 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 105 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 107 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 108 |
+
print_info: LF token = 198 'Ċ'
|
| 109 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 110 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 111 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 112 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 113 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 114 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 115 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 116 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 117 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 118 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 119 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 120 |
+
print_info: max token length = 256
|
| 121 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 122 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 123 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 124 |
+
load_tensors: CPU_Mapped model buffer size = 3641.67 MiB
|
| 125 |
+
load_tensors: CUDA0 model buffer size = 1870.32 MiB
|
| 126 |
+
load_tensors: CUDA1 model buffer size = 1870.32 MiB
|
| 127 |
+
..............................................................................................
|
| 128 |
+
llama_context: constructing llama_context
|
| 129 |
+
llama_context: n_seq_max = 1
|
| 130 |
+
llama_context: n_ctx = 2048
|
| 131 |
+
llama_context: n_ctx_seq = 2048
|
| 132 |
+
llama_context: n_batch = 2048
|
| 133 |
+
llama_context: n_ubatch = 512
|
| 134 |
+
llama_context: causal_attn = 1
|
| 135 |
+
llama_context: flash_attn = auto
|
| 136 |
+
llama_context: kv_unified = false
|
| 137 |
+
llama_context: freq_base = 5000000.0
|
| 138 |
+
llama_context: freq_scale = 1
|
| 139 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 140 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 141 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 144 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 145 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 146 |
+
llama_context: CUDA0 compute buffer size = 620.05 MiB
|
| 147 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 148 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 149 |
+
llama_context: graph nodes = 1267
|
| 150 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 151 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 155 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 156 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 157 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 158 |
+
|
| 159 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 160 |
+
perplexity: tokenizing the input ..
|
| 161 |
+
perplexity: tokenization took 116.021 ms
|
| 162 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 163 |
+
perplexity: 1.91 seconds per pass - ETA 1.40 minutes
|
| 164 |
+
[1]2.6329,[2]2.0660,[3]1.6242,[4]1.5012,[5]1.5952,[6]1.6537,[7]1.6200,[8]1.6041,[9]1.5383,[10]1.5004,[11]1.4724,[12]1.4761,[13]1.4502,[14]1.4327,[15]1.4437,[16]1.4266,[17]1.4120,[18]1.4141,[19]1.4022,[20]1.3872,[21]1.3799,[22]1.3778,[23]1.3985,[24]1.3888,[25]1.3924,[26]1.3798,[27]1.3725,[28]1.3702,[29]1.3834,[30]1.3852,[31]1.3764,[32]1.3686,[33]1.3695,[34]1.3677,[35]1.3663,[36]1.3900,[37]1.4004,[38]1.4062,[39]1.4137,[40]1.4141,[41]1.4092,[42]1.4229,[43]1.4227,[44]1.4241,
|
| 165 |
+
Final estimate: PPL = 1.4241 +/- 0.00887
|
| 166 |
+
|
| 167 |
+
llama_perf_context_print: load time = 1393.50 ms
|
| 168 |
+
llama_perf_context_print: prompt eval time = 70938.65 ms / 90112 tokens ( 0.79 ms per token, 1270.28 tokens per second)
|
| 169 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 170 |
+
llama_perf_context_print: total time = 72159.72 ms / 90113 tokens
|
| 171 |
+
llama_perf_context_print: graphs reused = 0
|
| 172 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18764 + (2570 = 1870 + 80 + 620) + 2772 |
|
| 174 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21434 + (2048 = 1870 + 80 + 98) + 641 |
|
| 175 |
+
llama_memory_breakdown_print: | - Host | 3781 = 3641 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21576 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 50 |
+
llama_model_loader: - type mxfp4: 37 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 7.21 GiB (7.56 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3vl
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 4096
|
| 66 |
+
print_info: n_embd_inp = 16384
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 12288
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = 0
|
| 91 |
+
print_info: rope type = 40
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 98 |
+
print_info: model type = 8B
|
| 99 |
+
print_info: model params = 8.19 B
|
| 100 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 101 |
+
print_info: vocab type = BPE
|
| 102 |
+
print_info: n_vocab = 151936
|
| 103 |
+
print_info: n_merges = 151387
|
| 104 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 105 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 107 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 108 |
+
print_info: LF token = 198 'Ċ'
|
| 109 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 110 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 111 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 112 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 113 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 114 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 115 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 116 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 117 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 118 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 119 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 120 |
+
print_info: max token length = 256
|
| 121 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 122 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 123 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 124 |
+
load_tensors: CPU_Mapped model buffer size = 3641.67 MiB
|
| 125 |
+
load_tensors: CUDA0 model buffer size = 1870.32 MiB
|
| 126 |
+
load_tensors: CUDA1 model buffer size = 1870.32 MiB
|
| 127 |
+
..............................................................................................
|
| 128 |
+
llama_context: constructing llama_context
|
| 129 |
+
llama_context: n_seq_max = 1
|
| 130 |
+
llama_context: n_ctx = 2048
|
| 131 |
+
llama_context: n_ctx_seq = 2048
|
| 132 |
+
llama_context: n_batch = 2048
|
| 133 |
+
llama_context: n_ubatch = 512
|
| 134 |
+
llama_context: causal_attn = 1
|
| 135 |
+
llama_context: flash_attn = auto
|
| 136 |
+
llama_context: kv_unified = false
|
| 137 |
+
llama_context: freq_base = 5000000.0
|
| 138 |
+
llama_context: freq_scale = 1
|
| 139 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 140 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 141 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 144 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 145 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 146 |
+
llama_context: CUDA0 compute buffer size = 620.05 MiB
|
| 147 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 148 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 149 |
+
llama_context: graph nodes = 1267
|
| 150 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 151 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 155 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 156 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 157 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 158 |
+
|
| 159 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 160 |
+
perplexity: tokenizing the input ..
|
| 161 |
+
perplexity: tokenization took 52.011 ms
|
| 162 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 163 |
+
perplexity: 1.92 seconds per pass - ETA 0.47 minutes
|
| 164 |
+
[1]7.4708,[2]9.1204,[3]9.5980,[4]9.1935,[5]8.8907,[6]7.5468,[7]6.8274,[8]6.8740,[9]7.2106,[10]7.3540,[11]7.3794,[12]7.7009,[13]7.7613,[14]7.9243,[15]7.9963,
|
| 165 |
+
Final estimate: PPL = 7.9963 +/- 0.16811
|
| 166 |
+
|
| 167 |
+
llama_perf_context_print: load time = 1433.39 ms
|
| 168 |
+
llama_perf_context_print: prompt eval time = 24880.66 ms / 30720 tokens ( 0.81 ms per token, 1234.69 tokens per second)
|
| 169 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 170 |
+
llama_perf_context_print: total time = 25323.26 ms / 30721 tokens
|
| 171 |
+
llama_perf_context_print: graphs reused = 0
|
| 172 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18764 + (2570 = 1870 + 80 + 620) + 2772 |
|
| 174 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21434 + (2048 = 1870 + 80 + 98) + 641 |
|
| 175 |
+
llama_memory_breakdown_print: | - Host | 3781 = 3641 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21434 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 50 |
+
llama_model_loader: - type mxfp4: 37 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 7.21 GiB (7.56 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3vl
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 4096
|
| 66 |
+
print_info: n_embd_inp = 16384
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 12288
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = 0
|
| 91 |
+
print_info: rope type = 40
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 98 |
+
print_info: model type = 8B
|
| 99 |
+
print_info: model params = 8.19 B
|
| 100 |
+
print_info: general.name = Qwen3 VL 8B Instruct Unsloth
|
| 101 |
+
print_info: vocab type = BPE
|
| 102 |
+
print_info: n_vocab = 151936
|
| 103 |
+
print_info: n_merges = 151387
|
| 104 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 105 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 107 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 108 |
+
print_info: LF token = 198 'Ċ'
|
| 109 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 110 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 111 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 112 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 113 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 114 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 115 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 116 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 117 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 118 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 119 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 120 |
+
print_info: max token length = 256
|
| 121 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 122 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 123 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 124 |
+
load_tensors: CPU_Mapped model buffer size = 3641.67 MiB
|
| 125 |
+
load_tensors: CUDA0 model buffer size = 1870.32 MiB
|
| 126 |
+
load_tensors: CUDA1 model buffer size = 1870.32 MiB
|
| 127 |
+
..............................................................................................
|
| 128 |
+
llama_context: constructing llama_context
|
| 129 |
+
llama_context: n_seq_max = 1
|
| 130 |
+
llama_context: n_ctx = 2048
|
| 131 |
+
llama_context: n_ctx_seq = 2048
|
| 132 |
+
llama_context: n_batch = 2048
|
| 133 |
+
llama_context: n_ubatch = 512
|
| 134 |
+
llama_context: causal_attn = 1
|
| 135 |
+
llama_context: flash_attn = auto
|
| 136 |
+
llama_context: kv_unified = false
|
| 137 |
+
llama_context: freq_base = 5000000.0
|
| 138 |
+
llama_context: freq_scale = 1
|
| 139 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 140 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 141 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 144 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 145 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 146 |
+
llama_context: CUDA0 compute buffer size = 620.05 MiB
|
| 147 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 148 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 149 |
+
llama_context: graph nodes = 1267
|
| 150 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 151 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 155 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 156 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 157 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 158 |
+
|
| 159 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 160 |
+
perplexity: tokenizing the input ..
|
| 161 |
+
perplexity: tokenization took 44.25 ms
|
| 162 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 163 |
+
perplexity: 1.83 seconds per pass - ETA 0.48 minutes
|
| 164 |
+
[1]5.3482,[2]5.9293,[3]6.1706,[4]6.2572,[5]6.4530,[6]6.4396,[7]6.4091,[8]6.3640,[9]6.3929,[10]6.3972,[11]6.4098,[12]6.3948,[13]6.4462,[14]6.4485,[15]6.4346,[16]6.4264,
|
| 165 |
+
Final estimate: PPL = 6.4264 +/- 0.11983
|
| 166 |
+
|
| 167 |
+
llama_perf_context_print: load time = 1343.31 ms
|
| 168 |
+
llama_perf_context_print: prompt eval time = 25663.03 ms / 32768 tokens ( 0.78 ms per token, 1276.86 tokens per second)
|
| 169 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 170 |
+
llama_perf_context_print: total time = 26100.77 ms / 32769 tokens
|
| 171 |
+
llama_perf_context_print: graphs reused = 0
|
| 172 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18764 + (2570 = 1870 + 80 + 620) + 2772 |
|
| 174 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21434 + (2048 = 1870 + 80 + 98) + 641 |
|
| 175 |
+
llama_memory_breakdown_print: | - Host | 3781 = 3641 + 128 + 12 |
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|