Can't run model with vllm

#1
by attashe - opened

Commands (tested on clear runpod container (runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404):

pip install vllm>=0.12.0
pip install --upgrade git+https://github.com/huggingface/transformers.git
vllm serve cyankiwi/GLM-4.6V-Flash-AWQ-8bit --port 7860 

Error:

(APIServer pid=2095) INFO 12-09 07:26:56 [api_server.py:1772] vLLM API server version 0.12.0
(APIServer pid=2095) INFO 12-09 07:26:56 [utils.py:253] non-default args: {'model_tag': 'cyankiwi/GLM-4.6V-Flash-AWQ-8bit', 'port': 7860, 'model': 'cyankiwi/GLM-4.6V-Flash-AWQ-8bit'}
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
config.json: 8.95kB [00:00, 10.8MB/s]
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'partial_rotary_factor'}
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
preprocessor_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 367/367 [00:00<00:00, 811kB/s]
(APIServer pid=2095) INFO 12-09 07:27:04 [model.py:637] Resolved architecture: Glm4vForConditionalGeneration
(APIServer pid=2095) INFO 12-09 07:27:04 [model.py:1750] Using max model len 131072
(APIServer pid=2095) INFO 12-09 07:27:06 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
tokenizer_config.json: 7.34kB [00:00, 10.7MB/s]
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/xet-read-token/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2 "HTTP/1.1 200 OK"
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 20.0M/20.0M [00:00<00:00, 24.7MB/s]
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.model "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/added_tokens.json "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/special_tokens_map.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/special_tokens_map.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/special_tokens_map.json "HTTP/1.1 200 OK"
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 836/836 [00:00<00:00, 1.98MB/s]
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
chat_template.jinja: 4.61kB [00:00, 3.26MB/s]
(APIServer pid=2095) [2025-12-09 07:27:09] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/generation_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/generation_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/generation_config.json "HTTP/1.1 200 OK"
generation_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 240/240 [00:00<00:00, 973kB/s]
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:11] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:19 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='cyankiwi/GLM-4.6V-Flash-AWQ-8bit', speculative_config=None, tokenizer='cyankiwi/GLM-4.6V-Flash-AWQ-8bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=cyankiwi/GLM-4.6V-Flash-AWQ-8bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:21] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:22 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.6.2:44323 backend=nccl
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:22 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/video_preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/video_preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/video_preprocessor_config.json "HTTP/1.1 200 OK"
video_preprocessor_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 369/369 [00:00<00:00, 605kB/s]
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/audio_tokenizer_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:29] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/video_preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/video_preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/audio_tokenizer_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:49 [gpu_model_runner.py:3467] Starting to load model cyankiwi/GLM-4.6V-Flash-AWQ-8bit...
(EngineCore_DP0 pid=2470) WARNING 12-09 07:27:49 [compressed_tensors.py:721] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:49 [compressed_tensors_wNa16.py:114] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore_DP0 pid=2470) INFO 12-09 07:28:10 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=false&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/revision/main "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model-00002-of-00003.safetensors "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model-00001-of-00003.safetensors "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/xet-read-token/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2 "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:11] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model-00003-of-00003.safetensors "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) INFO 12-09 07:29:13 [weight_utils.py:487] Time spent downloading weights for cyankiwi/GLM-4.6V-Flash-AWQ-8bit: 62.737901 seconds
(EngineCore_DP0 pid=2470) [2025-12-09 07:29:13] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/model.safetensors.index.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:29:13] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model.safetensors.index.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:29:13] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model.safetensors.index.json "HTTP/1.1 200 OK"
model.safetensors.index.json: 118kB [00:00, 31.3MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     super().__init__(
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self._init_executor()
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.driver_worker.load_model()
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3484, in load_model
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1825, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     yield from self._load_module(
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 856, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     param = params_dict[name]
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]             ~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843] KeyError: 'blocks.0.mlp.gate_up_proj.weight'
(EngineCore_DP0 pid=2470) Process EngineCore_DP0:
(EngineCore_DP0 pid=2470) Traceback (most recent call last):
(EngineCore_DP0 pid=2470)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=2470)     self.run()
(EngineCore_DP0 pid=2470)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=2470)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
(EngineCore_DP0 pid=2470)     raise e
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=2470)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=2470)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=2470)     super().__init__(
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=2470)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=2470)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=2470)     self._init_executor()
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=2470)     self.driver_worker.load_model()
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=2470)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3484, in load_model
(EngineCore_DP0 pid=2470)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=2470)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=2470)     self.load_weights(model, model_config)
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=2470)     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=2470)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1825, in load_weights
(EngineCore_DP0 pid=2470)     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=2470)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=2470)     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=2470)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=2470)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=2470)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=2470)     yield from self._load_module(
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=2470)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=2470)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 856, in load_weights
(EngineCore_DP0 pid=2470)     param = params_dict[name]
(EngineCore_DP0 pid=2470)             ~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=2470) KeyError: 'blocks.0.mlp.gate_up_proj.weight'
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:03<?, ?it/s]
(EngineCore_DP0 pid=2470) 
[rank0]:[W1209 07:29:18.942663927 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2095) Traceback (most recent call last):
(APIServer pid=2095)   File "/usr/local/bin/vllm", line 7, in <module>
(APIServer pid=2095)     sys.exit(main())
(APIServer pid=2095)              ^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=2095)     args.dispatch_function(args)
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=2095)     uvloop.run(run_server(args))
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2095)     return __asyncio.run(
(APIServer pid=2095)            ^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=2095)     return runner.run(main)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2095)     return self._loop.run_until_complete(task)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2095)     return await main
(APIServer pid=2095)            ^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1819, in run_server
(APIServer pid=2095)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1838, in run_server_worker
(APIServer pid=2095)     async with build_async_engine_client(
(APIServer pid=2095)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2095)     return await anext(self.gen)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 183, in build_async_engine_client
(APIServer pid=2095)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2095)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2095)     return await anext(self.gen)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 224, in build_async_engine_client_from_engine_args
(APIServer pid=2095)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2095)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config
(APIServer pid=2095)     return cls(
(APIServer pid=2095)            ^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=2095)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2095)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=2095)     return AsyncMPClient(*client_args)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 810, in __init__
(APIServer pid=2095)     super().__init__(
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 471, in __init__
(APIServer pid=2095)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=2095)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=2095)     next(self.gen)
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
(APIServer pid=2095)     wait_for_engine_startup(
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
(APIServer pid=2095)     raise RuntimeError(
(APIServer pid=2095) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
cyankiwi org

Thank you for letting me know. Please redownload the most config.json file and replace with the old one.

Thanks, it works now

cpatonn changed discussion status to closed

Sign up or log in to comment