vLLM is not working
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] File "/mnt/data/minconda/envs/hunyuan/lib/python3.12/site-packages/vllm/model_executor/models/transformers/multimodal.py", line 68, in get_max_image_tokens
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] mm_tokens = processor._get_num_multimodal_tokens(
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] AttributeError: 'Step3VLProcessor' object has no attribute '_get_num_multimodal_tokens'
Could you please provide the vllm serve command?
INFO 01-17 11:48:27 [utils.py:253] non-default args: {'trust_remote_code': True, 'dtype': 'bfloat16', 'seed': 42, 'max_model_len': 32768, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.96, 'max_num_seqs': 8, 'disable_log_stats': True, 'model': '/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO 01-17 11:48:33 [model.py:631] Resolved architecture: TransformersMultiModalForCausalLM
INFO 01-17 11:48:33 [model.py:1745] Using max model len 32768
INFO 01-17 11:48:33 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 01-17 11:48:33 [utils.py:177] TransformersMultiModalForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu128 Please see https://github.com/pytorch/ao/issues/2919 for more info
(EngineCore_DP0 pid=127656) INFO 01-17 11:48:38 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B', speculative_config=None, tokenizer='/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=42, served_model_name=/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}
(EngineCore_DP0 pid=127656) WARNING 01-17 11:48:38 [multiproc_executor.py:869] Reducing Torch parallelism from 10 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu128 Please see https://github.com/pytorch/ao/issues/2919 for more info
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu128 Please see https://github.com/pytorch/ao/issues/2919 for more info
WARNING 01-17 11:48:42 [utils.py:177] TransformersMultiModalForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
WARNING 01-17 11:48:42 [utils.py:177] TransformersMultiModalForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
/usr/local/lib/python3.12/dist-packages/torch/library.py:364: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: CUDA
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU_3.cpp:3798
new kernel: registered at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/batch_invariant.py:734 (Triggered internally at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:218.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:364: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: CUDA
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU_3.cpp:3798
new kernel: registered at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/batch_invariant.py:734 (Triggered internally at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:218.)
self.m.impl(
[W117 11:48:42.595966717 Context.cpp:580] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
[W117 11:48:42.596018744 Context.cpp:580] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
/usr/local/lib/python3.12/dist-packages/torch/backends/__init__.py:46: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
self.setter(val)
/usr/local/lib/python3.12/dist-packages/torch/backends/__init__.py:46: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
self.setter(val)
INFO 01-17 11:48:42 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:50205 backend=nccl
INFO 01-17 11:48:42 [parallel_state.py:1208] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:50205 backend=nccl
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 01-17 11:48:43 [pynccl.py:111] vLLM is using nccl==2.27.5
[Gloo] Rank 0 is connected to 0 peer ranks. [Gloo] Rank Expected number of connected peer ranks is : 00
is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank
0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank [Gloo] Rank 01 is connected to 1 is connected to peer ranks. 1Expected number of connected peer ranks is : peer ranks. 1Expected number of connected peer ranks is : 1
INFO 01-17 11:48:43 [parallel_state.py:1394] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 01-17 11:48:43 [parallel_state.py:1394] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
ERROR 01-17 11:48:46 [multiproc_executor.py:743] WorkerProc failed to start.
ERROR 01-17 11:48:46 [multiproc_executor.py:743] Traceback (most recent call last):
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
ERROR 01-17 11:48:46 [multiproc_executor.py:743] worker = WorkerProc(*args, **kwargs)
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 546, in __init__
ERROR 01-17 11:48:46 [multiproc_executor.py:743] self.worker.init_device()
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 324, in init_device
ERROR 01-17 11:48:46 [multiproc_executor.py:743] self.worker.init_device() # type: ignore
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 260, in init_device
ERROR 01-17 11:48:46 [multiproc_executor.py:743] self.model_runner: GPUModelRunner = GPUModelRunner(
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 533, in __init__
ERROR 01-17 11:48:46 [multiproc_executor.py:743] MultiModalBudget(
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 45, in __init__
ERROR 01-17 11:48:46 [multiproc_executor.py:743] max_tokens_by_modality = mm_registry.get_max_tokens_per_item_by_modality(
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 172, in get_max_tokens_per_item_by_modality
ERROR 01-17 11:48:46 [multiproc_executor.py:743] return profiler.get_mm_max_contiguous_tokens(
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 369, in get_mm_max_contiguous_tokens
ERROR 01-17 11:48:46 [multiproc_executor.py:743] return self._get_mm_max_tokens(seq_len, mm_counts, mm_embeddings_only=False)
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 340, in _get_mm_max_tokens
ERROR 01-17 11:48:46 [multiproc_executor.py:743] max_tokens_per_item = self.processing_info.get_mm_max_tokens_per_item(
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/multimodal.py", line 61, in get_mm_max_tokens_per_item
ERROR 01-17 11:48:46 [multiproc_executor.py:743] return {"image": self.get_max_image_tokens()}
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/multimodal.py", line 68, in get_max_image_tokens
ERROR 01-17 11:48:46 [multiproc_executor.py:743] mm_tokens = processor._get_num_multimodal_tokens(
ERROR 01-17 11:48:46 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] AttributeError: 'Step3VLProcessor' object has no attribute '_get_num_multimodal_tokens'
INFO 01-17 11:48:46 [multiproc_executor.py:702] Parent process exited, terminating worker
INFO 01-17 11:48:46 [multiproc_executor.py:702] Parent process exited, terminating worker
[rank0]:[W117 11:48:46.384880841 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] EngineCore failed to start.
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] Traceback (most recent call last):
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] super().__init__(
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] super().__init__(vllm_config)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] self._init_executor()
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 171, in _init_executor
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 653, in wait_for_ready
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] raise e from None
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=127656) Process EngineCore_DP0:
(EngineCore_DP0 pid=127656) Traceback (most recent call last):
(EngineCore_DP0 pid=127656) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=127656) self.run()
(EngineCore_DP0 pid=127656) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=127656) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=127656) raise e
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=127656) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=127656) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=127656) super().__init__(
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=127656) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=127656) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in __init__
(EngineCore_DP0 pid=127656) super().__init__(vllm_config)
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=127656) self._init_executor()
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 171, in _init_executor
(EngineCore_DP0 pid=127656) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=127656) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 653, in wait_for_ready
(EngineCore_DP0 pid=127656) raise e from None
(EngineCore_DP0 pid=127656) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[10], line 13
3 tokenizer = AutoTokenizer.from_pretrained(
4 model_name,
5 trust_remote_code=True,
(...) 9 device_map="auto",
10 )
11 print(f"Tokenizer loaded. Dictionary size: {tokenizer.vocab_size:,}")
---> 13 llm = LLM(
14 model_name,
15 tokenizer=tokenizer,
16 dtype="bfloat16",
17 max_num_seqs=max_num_seqs, # 256 Limiting parallel sequences for stability
18 max_model_len=max_model_len, # 32768 / 81920
19 trust_remote_code=True, # Required for Qwen
20
21 seed=seed_value, # Determinism
22 #max_num_batched_tokens=16*1024,
23 #disable_log_stats=True, # Disabling logging for performance
24 #skip_tokenizer_init=False, # Critical - Do not skip initialization
25 #enable_prefix_caching=True,
26 ###max_paddings=2048, # 256 or 2048 batch processing optimization
27 ###block_size=32, # Optimization for long context
28
29 ###tensor_parallel_size=1,
30 #enable_expert_parallel=True, # MoE
31 ###gpu_memory_utilization=0.96,
32 ###enforce_eager=True, # Disabling CUDA Graphs for predictability in offline mode
33 **llm_cfg
34 )
36 if not begin_tokenizer_load:
37 tokenizer = llm.get_tokenizer()
File /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:343, in LLM.__init__(self, model, runner, convert, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, allowed_media_domains, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, disable_custom_all_reduce, hf_token, hf_overrides, mm_processor_kwargs, pooler_config, override_pooler_config, structured_outputs_config, kv_cache_memory_bytes, compilation_config, logits_processors, **kwargs)
340 log_non_default_args(engine_args)
342 # Create the Engine (autoselects V0 vs V1)
--> 343 self.llm_engine = LLMEngine.from_engine_args(
344 engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
345 )
346 self.engine_class = type(self.llm_engine)
348 self.request_counter = Counter()
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:174, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers, enable_multiprocessing)
171 enable_multiprocessing = True
173 # Create the LLMEngine.
--> 174 return cls(
175 vllm_config=vllm_config,
176 executor_class=executor_class,
177 log_stats=not engine_args.disable_log_stats,
178 usage_context=usage_context,
179 stat_loggers=stat_loggers,
180 multiprocess_mode=enable_multiprocessing,
181 )
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:108, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, aggregate_engine_logging, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
105 self.output_processor.tracer = tracer
107 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 108 self.engine_core = EngineCoreClient.make_client(
109 multiprocess_mode=multiprocess_mode,
110 asyncio_mode=False,
111 vllm_config=vllm_config,
112 executor_class=executor_class,
113 log_stats=self.log_stats,
114 )
116 self.logger_manager: StatLoggerManager | None = None
117 if self.log_stats:
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:93, in EngineCoreClient.make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
88 return EngineCoreClient.make_async_mp_client(
89 vllm_config, executor_class, log_stats
90 )
92 if multiprocess_mode and not asyncio_mode:
---> 93 return SyncMPClient(vllm_config, executor_class, log_stats)
95 return InprocClient(vllm_config, executor_class, log_stats)
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:640, in SyncMPClient.__init__(self, vllm_config, executor_class, log_stats)
637 def __init__(
638 self, vllm_config: VllmConfig, executor_class: type[Executor], log_stats: bool
639 ):
--> 640 super().__init__(
641 asyncio_mode=False,
642 vllm_config=vllm_config,
643 executor_class=executor_class,
644 log_stats=log_stats,
645 )
647 self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1
648 self.outputs_queue = queue.Queue[EngineCoreOutputs | Exception]()
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:469, in MPClient.__init__(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
466 self.stats_update_address = client_addresses.get("stats_update_address")
467 else:
468 # Engines are managed by this client.
--> 469 with launch_core_engines(vllm_config, executor_class, log_stats) as (
470 engine_manager,
471 coordinator,
472 addresses,
473 ):
474 self.resources.coordinator = coordinator
475 self.resources.engine_manager = engine_manager
File /usr/lib/python3.12/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
142 if typ is None:
143 try:
--> 144 next(self.gen)
145 except StopIteration:
146 return False
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:907, in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
904 yield local_engine_manager, coordinator, addresses
906 # Now wait for engines to start.
--> 907 wait_for_engine_startup(
908 handshake_socket,
909 addresses,
910 engines_to_handshake,
911 parallel_config,
912 vllm_config.cache_config,
913 local_engine_manager,
914 coordinator.proc if coordinator else None,
915 )
File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:964, in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, cache_config, proc_manager, coord_process)
962 if coord_process is not None and coord_process.exitcode is not None:
963 finished[coord_process.name] = coord_process.exitcode
--> 964 raise RuntimeError(
965 "Engine core initialization failed. "
966 "See root cause above. "
967 f"Failed core proc(s): {finished}"
968 )
970 # Receive HELLO and READY messages from the input socket.
971 eng_identity, ready_msg_bytes = handshake_socket.recv_multipart()
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Same Problem
vllm command:vllm serve home/models/pretrain_weights/Step3-VL-10B -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.85 --trust-remote-code
environment:
python3.11
vllm0.11.0
Same Problem
vllm command:
vllm serve /model --served-model-name step3-vl:10b --host 0.0.0.0 --gpu-memory-utilization 0.6 -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code
environment:
docker image vllm/vllm-openai:v0.14.0