magiccodingman commited on Nov 16

Commit

bd267e6

1 Parent(s): e64b706

Add GGUF models + tokenizer with LFS

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_code.txt +173 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_general.txt +173 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_math.txt +173 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +173 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +173 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +173 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt +175 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt +175 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt +175 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.gguf filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B F16                 |  15.26 GiB |     8.19 B | CUDA       |  35 |             pp8 |        157.31 ± 1.63 |
+| qwen3vl 8B F16                 |  15.26 GiB |     8.19 B | CUDA       |  35 |           tg128 |         20.52 ± 0.02 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21539 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/Qwen3-VL-8B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  24:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  25:               general.quantization_version u32              = 2
+llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 15.26 GiB (16.00 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
+load_tensors:        CUDA0 model buffer size =  3680.32 MiB
+load_tensors:        CUDA1 model buffer size =  3680.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 112.248 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.97 seconds per pass - ETA 2.17 minutes
+[1]2.5123,[2]2.0010,[3]1.5894,[4]1.4775,[5]1.5704,[6]1.6209,[7]1.5889,[8]1.5728,[9]1.5112,[10]1.4743,[11]1.4488,[12]1.4529,[13]1.4287,[14]1.4135,[15]1.4232,[16]1.4071,[17]1.3931,[18]1.3946,[19]1.3839,[20]1.3695,[21]1.3630,[22]1.3613,[23]1.3810,[24]1.3722,[25]1.3768,[26]1.3649,[27]1.3579,[28]1.3558,[29]1.3692,[30]1.3706,[31]1.3622,[32]1.3550,[33]1.3559,[34]1.3544,[35]1.3532,[36]1.3756,[37]1.3849,[38]1.3897,[39]1.3960,[40]1.3964,[41]1.3918,[42]1.4045,[43]1.4042,[44]1.4053,
+Final estimate: PPL = 1.4053 +/- 0.00874
+llama_perf_context_print:        load time =    4057.43 ms
+llama_perf_context_print: prompt eval time =  116514.69 ms / 90112 tokens (    1.29 ms per token,   773.40 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  117753.81 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 16087 + ( 5252 =  3680 +      80 +    1491) +        2767 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19612 + ( 3858 =  3680 +      80 +      98) +         653 |
+llama_memory_breakdown_print: |   - Host               |                  15763 = 15623 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21539 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/Qwen3-VL-8B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  24:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  25:               general.quantization_version u32              = 2
+llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 15.26 GiB (16.00 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
+load_tensors:        CUDA0 model buffer size =  3680.32 MiB
+load_tensors:        CUDA1 model buffer size =  3680.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.641 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.90 seconds per pass - ETA 0.72 minutes
+[1]6.9501,[2]8.4922,[3]8.9206,[4]8.5067,[5]8.2771,[6]7.0818,[7]6.4316,[8]6.4697,[9]6.7493,[10]6.8592,[11]6.8594,[12]7.1646,[13]7.2302,[14]7.3576,[15]7.4343,
+Final estimate: PPL = 7.4343 +/- 0.15664
+llama_perf_context_print:        load time =    1922.95 ms
+llama_perf_context_print: prompt eval time =   40363.40 ms / 30720 tokens (    1.31 ms per token,   761.09 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   40848.59 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 16087 + ( 5252 =  3680 +      80 +    1491) +        2767 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19612 + ( 3858 =  3680 +      80 +      98) +         653 |
+llama_memory_breakdown_print: |   - Host               |                  15763 = 15623 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21539 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/Qwen3-VL-8B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  24:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  25:               general.quantization_version u32              = 2
+llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 15.26 GiB (16.00 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
+load_tensors:        CUDA0 model buffer size =  3680.32 MiB
+load_tensors:        CUDA1 model buffer size =  3680.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.836 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.90 seconds per pass - ETA 0.77 minutes
+[1]5.0178,[2]5.4503,[3]5.6477,[4]5.7466,[5]5.9143,[6]5.8806,[7]5.8450,[8]5.7878,[9]5.8197,[10]5.8095,[11]5.8265,[12]5.8087,[13]5.8727,[14]5.8747,[15]5.8583,[16]5.8563,
+Final estimate: PPL = 5.8563 +/- 0.10809
+llama_perf_context_print:        load time =    1984.68 ms
+llama_perf_context_print: prompt eval time =   42289.09 ms / 32768 tokens (    1.29 ms per token,   774.86 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   42743.27 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 16087 + ( 5252 =  3680 +      80 +    1491) +        2767 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19612 + ( 3858 =  3680 +      80 +      98) +         653 |
+llama_memory_breakdown_print: |   - Host               |                  15763 = 15623 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-F16/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   9.72 GiB |     8.19 B | CUDA       |  35 |             pp8 |        174.85 ± 3.34 |
+| qwen3vl 8B MXFP4 MoE           |   9.72 GiB |     8.19 B | CUDA       |  35 |           tg128 |         25.45 ± 0.09 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:   38 tensors
+llama_model_loader: - type q8_0:  216 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 9.72 GiB (10.19 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  5742.53 MiB
+load_tensors:        CUDA0 model buffer size =  2105.32 MiB
+load_tensors:        CUDA1 model buffer size =  2105.32 MiB
+...............................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 117.59 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.11 seconds per pass - ETA 1.53 minutes
+[1]2.5106,[2]2.0036,[3]1.5907,[4]1.4787,[5]1.5706,[6]1.6214,[7]1.5892,[8]1.5729,[9]1.5112,[10]1.4742,[11]1.4488,[12]1.4530,[13]1.4288,[14]1.4133,[15]1.4233,[16]1.4072,[17]1.3932,[18]1.3947,[19]1.3841,[20]1.3697,[21]1.3631,[22]1.3614,[23]1.3811,[24]1.3723,[25]1.3770,[26]1.3651,[27]1.3581,[28]1.3559,[29]1.3693,[30]1.3706,[31]1.3623,[32]1.3551,[33]1.3560,[34]1.3544,[35]1.3533,[36]1.3756,[37]1.3849,[38]1.3898,[39]1.3960,[40]1.3964,[41]1.3917,[42]1.4045,[43]1.4043,[44]1.4053,
+Final estimate: PPL = 1.4053 +/- 0.00873
+llama_perf_context_print:        load time =    1533.31 ms
+llama_perf_context_print: prompt eval time =   81963.84 ms / 90112 tokens (    0.91 ms per token,  1099.41 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   83204.60 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 17698 + (3677 =  2105 +      80 +    1491) +        2731 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21188 + (2283 =  2105 +      80 +      98) +         652 |
+llama_memory_breakdown_print: |   - Host               |                  5882 =  5742 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:   38 tensors
+llama_model_loader: - type q8_0:  216 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 9.72 GiB (10.19 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  5742.53 MiB
+load_tensors:        CUDA0 model buffer size =  2105.32 MiB
+load_tensors:        CUDA1 model buffer size =  2105.32 MiB
+...............................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.93 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.14 seconds per pass - ETA 0.53 minutes
+[1]6.9506,[2]8.4900,[3]8.9135,[4]8.5079,[5]8.2866,[6]7.0919,[7]6.4409,[8]6.4713,[9]6.7514,[10]6.8654,[11]6.8653,[12]7.1736,[13]7.2419,[14]7.3681,[15]7.4452,
+Final estimate: PPL = 7.4452 +/- 0.15691
+llama_perf_context_print:        load time =    1545.44 ms
+llama_perf_context_print: prompt eval time =   28151.75 ms / 30720 tokens (    0.92 ms per token,  1091.23 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   28583.01 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 17701 + (3677 =  2105 +      80 +    1491) +        2728 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21188 + (2283 =  2105 +      80 +      98) +         652 |
+llama_memory_breakdown_print: |   - Host               |                  5882 =  5742 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:   38 tensors
+llama_model_loader: - type q8_0:  216 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 9.72 GiB (10.19 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  5742.53 MiB
+load_tensors:        CUDA0 model buffer size =  2105.32 MiB
+load_tensors:        CUDA1 model buffer size =  2105.32 MiB
+...............................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.909 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.13 seconds per pass - ETA 0.57 minutes
+[1]5.0161,[2]5.4576,[3]5.6584,[4]5.7568,[5]5.9256,[6]5.8930,[7]5.8575,[8]5.7991,[9]5.8294,[10]5.8181,[11]5.8352,[12]5.8166,[13]5.8805,[14]5.8819,[15]5.8647,[16]5.8631,
+Final estimate: PPL = 5.8631 +/- 0.10827
+llama_perf_context_print:        load time =    1530.77 ms
+llama_perf_context_print: prompt eval time =   30397.68 ms / 32768 tokens (    0.93 ms per token,  1077.98 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   30863.63 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 17699 + (3677 =  2105 +      80 +    1491) +        2730 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21188 + (2283 =  2105 +      80 +      98) +         652 |
+llama_memory_breakdown_print: |   - Host               |                  5882 =  5742 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.24 GiB |     8.19 B | CUDA       |  35 |             pp8 |        275.90 ± 9.88 |
+| qwen3vl 8B MXFP4 MoE           |   7.24 GiB |     8.19 B | CUDA       |  35 |           tg128 |         46.54 ± 0.18 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21575 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.24 GiB (7.60 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3668.22 MiB
+load_tensors:        CUDA0 model buffer size =  1875.32 MiB
+load_tensors:        CUDA1 model buffer size =  1875.32 MiB
+.............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   638.59 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 115.181 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.82 seconds per pass - ETA 1.33 minutes
+[1]2.5461,[2]2.0222,[3]1.6008,[4]1.4877,[5]1.5820,[6]1.6363,[7]1.6017,[8]1.5841,[9]1.5209,[10]1.4820,[11]1.4562,[12]1.4609,[13]1.4362,[14]1.4201,[15]1.4293,[16]1.4126,[17]1.3983,[18]1.3997,[19]1.3886,[20]1.3739,[21]1.3673,[22]1.3653,[23]1.3857,[24]1.3769,[25]1.3814,[26]1.3694,[27]1.3625,[28]1.3608,[29]1.3744,[30]1.3759,[31]1.3674,[32]1.3600,[33]1.3613,[34]1.3596,[35]1.3582,[36]1.3812,[37]1.3905,[38]1.3957,[39]1.4021,[40]1.4025,[41]1.3977,[42]1.4111,[43]1.4109,[44]1.4119,
+Final estimate: PPL = 1.4119 +/- 0.00885
+llama_perf_context_print:        load time =    1313.68 ms
+llama_perf_context_print: prompt eval time =   70469.89 ms / 90112 tokens (    0.78 ms per token,  1278.73 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   71650.96 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18884 + (2593 =  1875 +      80 +     638) +        2628 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21430 + (2053 =  1875 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3808 =  3668 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.24 GiB (7.60 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3668.22 MiB
+load_tensors:        CUDA0 model buffer size =  1875.32 MiB
+load_tensors:        CUDA1 model buffer size =  1875.32 MiB
+.............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   638.59 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.695 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.86 seconds per pass - ETA 0.45 minutes
+[1]7.1465,[2]8.6407,[3]9.1399,[4]8.7235,[5]8.4601,[6]7.2164,[7]6.5411,[8]6.5728,[9]6.8688,[10]6.9913,[11]6.9949,[12]7.3083,[13]7.3732,[14]7.5144,[15]7.5898,
+Final estimate: PPL = 7.5898 +/- 0.16063
+llama_perf_context_print:        load time =    1351.48 ms
+llama_perf_context_print: prompt eval time =   24189.03 ms / 30720 tokens (    0.79 ms per token,  1270.00 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   24604.03 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18881 + (2593 =  1875 +      80 +     638) +        2631 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21430 + (2053 =  1875 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3808 =  3668 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21576 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.24 GiB (7.60 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3668.22 MiB
+load_tensors:        CUDA0 model buffer size =  1875.32 MiB
+load_tensors:        CUDA1 model buffer size =  1875.32 MiB
+.............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   638.59 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.442 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.79 seconds per pass - ETA 0.47 minutes
+[1]5.1022,[2]5.5387,[3]5.7284,[4]5.8184,[5]6.0015,[6]5.9571,[7]5.9217,[8]5.8574,[9]5.8853,[10]5.8751,[11]5.8916,[12]5.8752,[13]5.9419,[14]5.9401,[15]5.9276,[16]5.9246,
+Final estimate: PPL = 5.9246 +/- 0.10958
+llama_perf_context_print:        load time =    1398.54 ms
+llama_perf_context_print: prompt eval time =   25856.18 ms / 32768 tokens (    0.79 ms per token,  1267.32 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26290.33 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18882 + (2593 =  1875 +      80 +     638) +        2630 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21430 + (2053 =  1875 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3808 =  3668 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.46 GiB |     8.19 B | CUDA       |  35 |             pp8 |        268.54 ± 8.59 |
+| qwen3vl 8B MXFP4 MoE           |   7.46 GiB |     8.19 B | CUDA       |  35 |           tg128 |         44.96 ± 0.47 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q5_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.46 GiB (7.82 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3848.59 MiB
+load_tensors:        CUDA0 model buffer size =  1895.32 MiB
+load_tensors:        CUDA1 model buffer size =  1895.32 MiB
+............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   712.78 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 111.939 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.83 seconds per pass - ETA 1.33 minutes
+[1]2.5224,[2]2.0051,[3]1.5915,[4]1.4799,[5]1.5723,[6]1.6226,[7]1.5896,[8]1.5734,[9]1.5111,[10]1.4738,[11]1.4484,[12]1.4525,[13]1.4284,[14]1.4130,[15]1.4228,[16]1.4070,[17]1.3932,[18]1.3948,[19]1.3837,[20]1.3692,[21]1.3627,[22]1.3609,[23]1.3808,[24]1.3720,[25]1.3765,[26]1.3647,[27]1.3577,[28]1.3556,[29]1.3691,[30]1.3706,[31]1.3622,[32]1.3550,[33]1.3560,[34]1.3544,[35]1.3533,[36]1.3759,[37]1.3852,[38]1.3901,[39]1.3964,[40]1.3968,[41]1.3922,[42]1.4052,[43]1.4050,[44]1.4058,
+Final estimate: PPL = 1.4058 +/- 0.00872
+llama_perf_context_print:        load time =    1353.95 ms
+llama_perf_context_print: prompt eval time =   72628.10 ms / 90112 tokens (    0.81 ms per token,  1240.73 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   73829.72 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18787 + (2688 =  1895 +      80 +     712) +        2631 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21410 + (2073 =  1895 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3988 =  3848 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q5_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.46 GiB (7.82 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3848.59 MiB
+load_tensors:        CUDA0 model buffer size =  1895.32 MiB
+load_tensors:        CUDA1 model buffer size =  1895.32 MiB
+............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   712.78 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.074 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.83 seconds per pass - ETA 0.45 minutes
+[1]7.0320,[2]8.5604,[3]9.0161,[4]8.5972,[5]8.3645,[6]7.1528,[7]6.4874,[8]6.5161,[9]6.7994,[10]6.9151,[11]6.9091,[12]7.2185,[13]7.2796,[14]7.4107,[15]7.4899,
+Final estimate: PPL = 7.4899 +/- 0.15780
+llama_perf_context_print:        load time =    1363.37 ms
+llama_perf_context_print: prompt eval time =   24536.66 ms / 30720 tokens (    0.80 ms per token,  1252.00 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   24945.45 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18788 + (2688 =  1895 +      80 +     712) +        2630 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21410 + (2073 =  1895 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3988 =  3848 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q5_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.46 GiB (7.82 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3848.59 MiB
+load_tensors:        CUDA0 model buffer size =  1895.32 MiB
+load_tensors:        CUDA1 model buffer size =  1895.32 MiB
+............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   712.78 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.786 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.84 seconds per pass - ETA 0.48 minutes
+[1]5.0420,[2]5.4883,[3]5.6921,[4]5.7939,[5]5.9530,[6]5.9219,[7]5.8762,[8]5.8166,[9]5.8427,[10]5.8306,[11]5.8530,[12]5.8371,[13]5.9020,[14]5.9012,[15]5.8880,[16]5.8853,
+Final estimate: PPL = 5.8853 +/- 0.10873
+llama_perf_context_print:        load time =    1357.45 ms
+llama_perf_context_print: prompt eval time =   26131.24 ms / 32768 tokens (    0.80 ms per token,  1253.98 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26564.25 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18784 + (2688 =  1895 +      80 +     712) +        2633 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21410 + (2073 =  1895 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3988 =  3848 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.69 GiB |     8.19 B | CUDA       |  35 |             pp8 |        264.49 ± 5.33 |
+| qwen3vl 8B MXFP4 MoE           |   7.69 GiB |     8.19 B | CUDA       |  35 |           tg128 |         41.62 ± 0.17 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q6_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.69 GiB (8.06 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4040.24 MiB
+load_tensors:        CUDA0 model buffer size =  1916.57 MiB
+load_tensors:        CUDA1 model buffer size =  1916.57 MiB
+..........................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   791.61 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.893 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.85 seconds per pass - ETA 1.35 minutes
+[1]2.5108,[2]2.0019,[3]1.5898,[4]1.4778,[5]1.5709,[6]1.6221,[7]1.5899,[8]1.5736,[9]1.5116,[10]1.4747,[11]1.4491,[12]1.4534,[13]1.4292,[14]1.4137,[15]1.4236,[16]1.4074,[17]1.3935,[18]1.3949,[19]1.3841,[20]1.3697,[21]1.3631,[22]1.3615,[23]1.3812,[24]1.3724,[25]1.3771,[26]1.3652,[27]1.3581,[28]1.3561,[29]1.3695,[30]1.3710,[31]1.3626,[32]1.3553,[33]1.3563,[34]1.3547,[35]1.3536,[36]1.3760,[37]1.3853,[38]1.3902,[39]1.3964,[40]1.3968,[41]1.3922,[42]1.4050,[43]1.4048,[44]1.4057,
+Final estimate: PPL = 1.4057 +/- 0.00873
+llama_perf_context_print:        load time =    1371.62 ms
+llama_perf_context_print: prompt eval time =   73112.97 ms / 90112 tokens (    0.81 ms per token,  1232.50 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   74384.48 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18688 + (2788 =  1916 +      80 +     791) +        2629 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21388 + (2094 =  1916 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  4180 =  4040 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21570 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q6_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.69 GiB (8.06 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4040.24 MiB
+load_tensors:        CUDA0 model buffer size =  1916.57 MiB
+load_tensors:        CUDA1 model buffer size =  1916.57 MiB
+..........................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   791.61 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 53.802 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.87 seconds per pass - ETA 0.47 minutes
+[1]6.9562,[2]8.4983,[3]8.9296,[4]8.5223,[5]8.2998,[6]7.0994,[7]6.4443,[8]6.4736,[9]6.7542,[10]6.8665,[11]6.8674,[12]7.1743,[13]7.2414,[14]7.3696,[15]7.4477,
+Final estimate: PPL = 7.4477 +/- 0.15673
+llama_perf_context_print:        load time =    1388.47 ms
+llama_perf_context_print: prompt eval time =   25116.55 ms / 30720 tokens (    0.82 ms per token,  1223.10 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   25560.37 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18684 + (2788 =  1916 +      80 +     791) +        2633 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21388 + (2094 =  1916 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  4180 =  4040 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q6_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.69 GiB (8.06 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4040.24 MiB
+load_tensors:        CUDA0 model buffer size =  1916.57 MiB
+load_tensors:        CUDA1 model buffer size =  1916.57 MiB
+..........................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   791.61 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.404 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.87 seconds per pass - ETA 0.48 minutes
+[1]5.0090,[2]5.4459,[3]5.6503,[4]5.7510,[5]5.9168,[6]5.8860,[7]5.8513,[8]5.7927,[9]5.8249,[10]5.8130,[11]5.8316,[12]5.8131,[13]5.8761,[14]5.8777,[15]5.8629,[16]5.8615,
+Final estimate: PPL = 5.8615 +/- 0.10817
+llama_perf_context_print:        load time =    1385.16 ms
+llama_perf_context_print: prompt eval time =   26813.84 ms / 32768 tokens (    0.82 ms per token,  1222.06 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   27244.56 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18688 + (2788 =  1916 +      80 +     791) +        2629 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21388 + (2094 =  1916 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  4180 =  4040 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   8.11 GiB |     8.19 B | CUDA       |  35 |             pp8 |       224.83 ± 11.50 |
+| qwen3vl 8B MXFP4 MoE           |   8.11 GiB |     8.19 B | CUDA       |  35 |           tg128 |         35.86 ± 0.22 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 8.11 GiB (8.50 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4389.72 MiB
+load_tensors:        CUDA0 model buffer size =  1955.32 MiB
+load_tensors:        CUDA1 model buffer size =  1955.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   935.34 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.764 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.94 seconds per pass - ETA 1.42 minutes
+[1]2.5086,[2]2.0027,[3]1.5903,[4]1.4778,[5]1.5697,[6]1.6206,[7]1.5885,[8]1.5722,[9]1.5105,[10]1.4736,[11]1.4482,[12]1.4524,[13]1.4284,[14]1.4129,[15]1.4230,[16]1.4069,[17]1.3930,[18]1.3945,[19]1.3839,[20]1.3695,[21]1.3630,[22]1.3614,[23]1.3813,[24]1.3725,[25]1.3770,[26]1.3651,[27]1.3580,[28]1.3560,[29]1.3693,[30]1.3709,[31]1.3626,[32]1.3553,[33]1.3563,[34]1.3547,[35]1.3536,[36]1.3760,[37]1.3852,[38]1.3901,[39]1.3964,[40]1.3968,[41]1.3921,[42]1.4049,[43]1.4047,[44]1.4057,
+Final estimate: PPL = 1.4057 +/- 0.00874
+llama_perf_context_print:        load time =    1370.66 ms
+llama_perf_context_print: prompt eval time =   74980.38 ms / 90112 tokens (    0.83 ms per token,  1201.81 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   76183.00 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18506 + (2970 =  1955 +      80 +     935) +        2629 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21350 + (2133 =  1955 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  4529 =  4389 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21573 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 8.11 GiB (8.50 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4389.72 MiB
+load_tensors:        CUDA0 model buffer size =  1955.32 MiB
+load_tensors:        CUDA1 model buffer size =  1955.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   935.34 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 51.033 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.97 seconds per pass - ETA 0.48 minutes
+[1]6.9869,[2]8.5129,[3]8.9244,[4]8.5289,[5]8.3002,[6]7.1014,[7]6.4486,[8]6.4795,[9]6.7587,[10]6.8718,[11]6.8729,[12]7.1788,[13]7.2458,[14]7.3735,[15]7.4515,
+Final estimate: PPL = 7.4515 +/- 0.15701
+llama_perf_context_print:        load time =    1382.45 ms
+llama_perf_context_print: prompt eval time =   25824.04 ms / 30720 tokens (    0.84 ms per token,  1189.59 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26242.73 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18505 + (2970 =  1955 +      80 +     935) +        2631 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21350 + (2133 =  1955 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  4529 =  4389 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21574 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 8.11 GiB (8.50 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4389.72 MiB
+load_tensors:        CUDA0 model buffer size =  1955.32 MiB
+load_tensors:        CUDA1 model buffer size =  1955.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   935.34 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.391 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.94 seconds per pass - ETA 0.52 minutes
+[1]5.0078,[2]5.4396,[3]5.6464,[4]5.7479,[5]5.9166,[6]5.8832,[7]5.8506,[8]5.7907,[9]5.8240,[10]5.8132,[11]5.8323,[12]5.8145,[13]5.8774,[14]5.8792,[15]5.8645,[16]5.8639,
+Final estimate: PPL = 5.8639 +/- 0.10825
+llama_perf_context_print:        load time =    1420.00 ms
+llama_perf_context_print: prompt eval time =   28124.87 ms / 32768 tokens (    0.86 ms per token,  1165.09 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   28614.04 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18504 + (2970 =  1955 +      80 +     935) +        2631 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21350 + (2133 =  1955 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  4529 =  4389 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.21 GiB |     8.19 B | CUDA       |  35 |             pp8 |        274.38 ± 8.11 |
+| qwen3vl 8B MXFP4 MoE           |   7.21 GiB |     8.19 B | CUDA       |  35 |           tg128 |         45.70 ± 0.61 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,175 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21434 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:    1 tensors
+llama_model_loader: - type mxfp4:   37 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.21 GiB (7.56 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3641.67 MiB
+load_tensors:        CUDA0 model buffer size =  1870.32 MiB
+load_tensors:        CUDA1 model buffer size =  1870.32 MiB
+..............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   620.05 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 116.021 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.91 seconds per pass - ETA 1.40 minutes
+[1]2.6329,[2]2.0660,[3]1.6242,[4]1.5012,[5]1.5952,[6]1.6537,[7]1.6200,[8]1.6041,[9]1.5383,[10]1.5004,[11]1.4724,[12]1.4761,[13]1.4502,[14]1.4327,[15]1.4437,[16]1.4266,[17]1.4120,[18]1.4141,[19]1.4022,[20]1.3872,[21]1.3799,[22]1.3778,[23]1.3985,[24]1.3888,[25]1.3924,[26]1.3798,[27]1.3725,[28]1.3702,[29]1.3834,[30]1.3852,[31]1.3764,[32]1.3686,[33]1.3695,[34]1.3677,[35]1.3663,[36]1.3900,[37]1.4004,[38]1.4062,[39]1.4137,[40]1.4141,[41]1.4092,[42]1.4229,[43]1.4227,[44]1.4241,
+Final estimate: PPL = 1.4241 +/- 0.00887
+llama_perf_context_print:        load time =    1393.50 ms
+llama_perf_context_print: prompt eval time =   70938.65 ms / 90112 tokens (    0.79 ms per token,  1270.28 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   72159.72 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18764 + (2570 =  1870 +      80 +     620) +        2772 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21434 + (2048 =  1870 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  3781 =  3641 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,175 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21576 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:    1 tensors
+llama_model_loader: - type mxfp4:   37 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.21 GiB (7.56 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3641.67 MiB
+load_tensors:        CUDA0 model buffer size =  1870.32 MiB
+load_tensors:        CUDA1 model buffer size =  1870.32 MiB
+..............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   620.05 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 52.011 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.92 seconds per pass - ETA 0.47 minutes
+[1]7.4708,[2]9.1204,[3]9.5980,[4]9.1935,[5]8.8907,[6]7.5468,[7]6.8274,[8]6.8740,[9]7.2106,[10]7.3540,[11]7.3794,[12]7.7009,[13]7.7613,[14]7.9243,[15]7.9963,
+Final estimate: PPL = 7.9963 +/- 0.16811
+llama_perf_context_print:        load time =    1433.39 ms
+llama_perf_context_print: prompt eval time =   24880.66 ms / 30720 tokens (    0.81 ms per token,  1234.69 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   25323.26 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18764 + (2570 =  1870 +      80 +     620) +        2772 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21434 + (2048 =  1870 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  3781 =  3641 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,175 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21434 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Instruct-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:    1 tensors
+llama_model_loader: - type mxfp4:   37 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.21 GiB (7.56 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3641.67 MiB
+load_tensors:        CUDA0 model buffer size =  1870.32 MiB
+load_tensors:        CUDA1 model buffer size =  1870.32 MiB
+..............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   620.05 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.25 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.83 seconds per pass - ETA 0.48 minutes
+[1]5.3482,[2]5.9293,[3]6.1706,[4]6.2572,[5]6.4530,[6]6.4396,[7]6.4091,[8]6.3640,[9]6.3929,[10]6.3972,[11]6.4098,[12]6.3948,[13]6.4462,[14]6.4485,[15]6.4346,[16]6.4264,
+Final estimate: PPL = 6.4264 +/- 0.11983
+llama_perf_context_print:        load time =    1343.31 ms
+llama_perf_context_print: prompt eval time =   25663.03 ms / 32768 tokens (    0.78 ms per token,  1276.86 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26100.77 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18764 + (2570 =  1870 +      80 +     620) +        2772 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21434 + (2048 =  1870 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  3781 =  3641 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff