Can't run model with vllm

by attashe - opened 1 day ago

1 day ago

Commands (tested on clear runpod container (runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404):

pip install vllm>=0.12.0
pip install --upgrade git+https://github.com/huggingface/transformers.git
vllm serve cyankiwi/GLM-4.6V-Flash-AWQ-8bit --port 7860

Error:

(APIServer pid=2095) INFO 12-09 07:26:56 [api_server.py:1772] vLLM API server version 0.12.0
(APIServer pid=2095) INFO 12-09 07:26:56 [utils.py:253] non-default args: {'model_tag': 'cyankiwi/GLM-4.6V-Flash-AWQ-8bit', 'port': 7860, 'model': 'cyankiwi/GLM-4.6V-Flash-AWQ-8bit'}
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
config.json: 8.95kB [00:00, 10.8MB/s]
(APIServer pid=2095) [2025-12-09 07:26:56] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'partial_rotary_factor'}
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:26:57] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████| 367/367 [00:00<00:00, 811kB/s]
(APIServer pid=2095) INFO 12-09 07:27:04 [model.py:637] Resolved architecture: Glm4vForConditionalGeneration
(APIServer pid=2095) INFO 12-09 07:27:04 [model.py:1750] Using max model len 131072
(APIServer pid=2095) INFO 12-09 07:27:06 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
tokenizer_config.json: 7.34kB [00:00, 10.7MB/s]
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(APIServer pid=2095) [2025-12-09 07:27:06] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/xet-read-token/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2 "HTTP/1.1 200 OK"
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████| 20.0M/20.0M [00:00<00:00, 24.7MB/s]
(APIServer pid=2095) [2025-12-09 07:27:07] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.model "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/added_tokens.json "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/special_tokens_map.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/special_tokens_map.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/special_tokens_map.json "HTTP/1.1 200 OK"
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████| 836/836 [00:00<00:00, 1.98MB/s]
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:08] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
chat_template.jinja: 4.61kB [00:00, 3.26MB/s]
(APIServer pid=2095) [2025-12-09 07:27:09] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/generation_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/generation_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/generation_config.json "HTTP/1.1 200 OK"
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 240/240 [00:00<00:00, 973kB/s]
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(APIServer pid=2095) [2025-12-09 07:27:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(APIServer pid=2095) [2025-12-09 07:27:11] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:19 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='cyankiwi/GLM-4.6V-Flash-AWQ-8bit', speculative_config=None, tokenizer='cyankiwi/GLM-4.6V-Flash-AWQ-8bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=cyankiwi/GLM-4.6V-Flash-AWQ-8bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:20] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:21] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:22 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.6.2:44323 backend=nccl
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:22 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:22] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:23] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/video_preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/video_preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:24] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/video_preprocessor_config.json "HTTP/1.1 200 OK"
video_preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 605kB/s]
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:25] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/audio_tokenizer_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:27] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/tokenizer_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:28] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:29] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/video_preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/video_preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/preprocessor_config.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:31] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/preprocessor_config.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/processor_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/chat_template.jinja "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:27:32] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/audio_tokenizer_config.json "HTTP/1.1 404 Not Found"
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:49 [gpu_model_runner.py:3467] Starting to load model cyankiwi/GLM-4.6V-Flash-AWQ-8bit...
(EngineCore_DP0 pid=2470) WARNING 12-09 07:27:49 [compressed_tensors.py:721] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=2470) INFO 12-09 07:27:49 [compressed_tensors_wNa16.py:114] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore_DP0 pid=2470) INFO 12-09 07:28:10 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/tree/main?recursive=false&expand=false "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/revision/main "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model-00002-of-00003.safetensors "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model-00001-of-00003.safetensors "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:10] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/xet-read-token/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2 "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:28:11] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model-00003-of-00003.safetensors "HTTP/1.1 302 Found"
(EngineCore_DP0 pid=2470) INFO 12-09 07:29:13 [weight_utils.py:487] Time spent downloading weights for cyankiwi/GLM-4.6V-Flash-AWQ-8bit: 62.737901 seconds
(EngineCore_DP0 pid=2470) [2025-12-09 07:29:13] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/resolve/main/model.safetensors.index.json "HTTP/1.1 307 Temporary Redirect"
(EngineCore_DP0 pid=2470) [2025-12-09 07:29:13] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model.safetensors.index.json "HTTP/1.1 200 OK"
(EngineCore_DP0 pid=2470) [2025-12-09 07:29:13] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/resolve-cache/models/cyankiwi/GLM-4.6V-Flash-AWQ-8bit/4486c0c7e3cf5ec3683626e624da4d6de4e0dfd2/model.safetensors.index.json "HTTP/1.1 200 OK"
model.safetensors.index.json: 118kB [00:00, 31.3MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     super().__init__(
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self._init_executor()
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.driver_worker.load_model()
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3484, in load_model
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1825, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     yield from self._load_module(
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 856, in load_weights
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]     param = params_dict[name]
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843]             ~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=2470) ERROR 12-09 07:29:17 [core.py:843] KeyError: 'blocks.0.mlp.gate_up_proj.weight'
(EngineCore_DP0 pid=2470) Process EngineCore_DP0:
(EngineCore_DP0 pid=2470) Traceback (most recent call last):
(EngineCore_DP0 pid=2470)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=2470)     self.run()
(EngineCore_DP0 pid=2470)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=2470)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
(EngineCore_DP0 pid=2470)     raise e
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=2470)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=2470)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=2470)     super().__init__(
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=2470)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=2470)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=2470)     self._init_executor()
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=2470)     self.driver_worker.load_model()
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=2470)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3484, in load_model
(EngineCore_DP0 pid=2470)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=2470)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=2470)     self.load_weights(model, model_config)
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=2470)     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=2470)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1825, in load_weights
(EngineCore_DP0 pid=2470)     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=2470)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=2470)     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=2470)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=2470)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=2470)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=2470)     yield from self._load_module(
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=2470)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=2470)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2470)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 856, in load_weights
(EngineCore_DP0 pid=2470)     param = params_dict[name]
(EngineCore_DP0 pid=2470)             ~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=2470) KeyError: 'blocks.0.mlp.gate_up_proj.weight'
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:03<?, ?it/s]
(EngineCore_DP0 pid=2470) 
[rank0]:[W1209 07:29:18.942663927 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2095) Traceback (most recent call last):
(APIServer pid=2095)   File "/usr/local/bin/vllm", line 7, in <module>
(APIServer pid=2095)     sys.exit(main())
(APIServer pid=2095)              ^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=2095)     args.dispatch_function(args)
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=2095)     uvloop.run(run_server(args))
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2095)     return __asyncio.run(
(APIServer pid=2095)            ^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=2095)     return runner.run(main)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2095)     return self._loop.run_until_complete(task)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2095)     return await main
(APIServer pid=2095)            ^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1819, in run_server
(APIServer pid=2095)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1838, in run_server_worker
(APIServer pid=2095)     async with build_async_engine_client(
(APIServer pid=2095)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2095)     return await anext(self.gen)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 183, in build_async_engine_client
(APIServer pid=2095)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2095)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2095)     return await anext(self.gen)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 224, in build_async_engine_client_from_engine_args
(APIServer pid=2095)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2095)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config
(APIServer pid=2095)     return cls(
(APIServer pid=2095)            ^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=2095)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2095)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=2095)     return AsyncMPClient(*client_args)
(APIServer pid=2095)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 810, in __init__
(APIServer pid=2095)     super().__init__(
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 471, in __init__
(APIServer pid=2095)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=2095)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=2095)     next(self.gen)
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
(APIServer pid=2095)     wait_for_engine_startup(
(APIServer pid=2095)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
(APIServer pid=2095)     raise RuntimeError(
(APIServer pid=2095) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

cpatonn

cyankiwi org 1 day ago

Thank you for letting me know. Please redownload the most config.json file and replace with the old one.

attashe

about 6 hours ago

Thanks, it works now

cpatonn changed discussion status to closed about 6 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment