vLLM is not working

by wei01 - opened 7 days ago

7 days ago

(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] File "/mnt/data/minconda/envs/hunyuan/lib/python3.12/site-packages/vllm/model_executor/models/transformers/multimodal.py", line 68, in get_max_image_tokens
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] mm_tokens = processor._get_num_multimodal_tokens(
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3160168) ERROR 01-16 14:39:15 [core.py:842] AttributeError: 'Step3VLProcessor' object has no attribute '_get_num_multimodal_tokens'

Tingdan

StepFun org 7 days ago

Could you please provide the vllm serve command?

pytokusu

6 days ago

•

edited 6 days ago

INFO 01-17 11:48:27 [utils.py:253] non-default args: {'trust_remote_code': True, 'dtype': 'bfloat16', 'seed': 42, 'max_model_len': 32768, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.96, 'max_num_seqs': 8, 'disable_log_stats': True, 'model': '/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO 01-17 11:48:33 [model.py:631] Resolved architecture: TransformersMultiModalForCausalLM
INFO 01-17 11:48:33 [model.py:1745] Using max model len 32768
INFO 01-17 11:48:33 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 01-17 11:48:33 [utils.py:177] TransformersMultiModalForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu128             Please see https://github.com/pytorch/ao/issues/2919 for more info
(EngineCore_DP0 pid=127656) INFO 01-17 11:48:38 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B', speculative_config=None, tokenizer='/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=42, served_model_name=/kaggle/models/text-to-text/stepfun-ai/Step3-VL-10B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}
(EngineCore_DP0 pid=127656) WARNING 01-17 11:48:38 [multiproc_executor.py:869] Reducing Torch parallelism from 10 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu128             Please see https://github.com/pytorch/ao/issues/2919 for more info
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu128             Please see https://github.com/pytorch/ao/issues/2919 for more info
WARNING 01-17 11:48:42 [utils.py:177] TransformersMultiModalForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
WARNING 01-17 11:48:42 [utils.py:177] TransformersMultiModalForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
/usr/local/lib/python3.12/dist-packages/torch/library.py:364: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: CUDA
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU_3.cpp:3798
       new kernel: registered at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/batch_invariant.py:734 (Triggered internally at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:218.)
  self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:364: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: CUDA
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU_3.cpp:3798
       new kernel: registered at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/batch_invariant.py:734 (Triggered internally at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:218.)
  self.m.impl(
[W117 11:48:42.595966717 Context.cpp:580] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
[W117 11:48:42.596018744 Context.cpp:580] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
/usr/local/lib/python3.12/dist-packages/torch/backends/__init__.py:46: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  self.setter(val)
/usr/local/lib/python3.12/dist-packages/torch/backends/__init__.py:46: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  self.setter(val)
INFO 01-17 11:48:42 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:50205 backend=nccl
INFO 01-17 11:48:42 [parallel_state.py:1208] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:50205 backend=nccl
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 01-17 11:48:43 [pynccl.py:111] vLLM is using nccl==2.27.5
[Gloo] Rank 0 is connected to 0 peer ranks. [Gloo] Rank Expected number of connected peer ranks is : 00
 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 
0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank [Gloo] Rank 01 is connected to 1 is connected to  peer ranks. 1Expected number of connected peer ranks is :  peer ranks. 1Expected number of connected peer ranks is : 1

INFO 01-17 11:48:43 [parallel_state.py:1394] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 01-17 11:48:43 [parallel_state.py:1394] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
ERROR 01-17 11:48:46 [multiproc_executor.py:743] WorkerProc failed to start.
ERROR 01-17 11:48:46 [multiproc_executor.py:743] Traceback (most recent call last):
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     worker = WorkerProc(*args, **kwargs)
ERROR 01-17 11:48:46 [multiproc_executor.py:743]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 546, in __init__
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     self.worker.init_device()
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 324, in init_device
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     self.worker.init_device()  # type: ignore
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 260, in init_device
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     self.model_runner: GPUModelRunner = GPUModelRunner(
ERROR 01-17 11:48:46 [multiproc_executor.py:743]                                         ^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 533, in __init__
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     MultiModalBudget(
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 45, in __init__
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     max_tokens_by_modality = mm_registry.get_max_tokens_per_item_by_modality(
ERROR 01-17 11:48:46 [multiproc_executor.py:743]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 172, in get_max_tokens_per_item_by_modality
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     return profiler.get_mm_max_contiguous_tokens(
ERROR 01-17 11:48:46 [multiproc_executor.py:743]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 369, in get_mm_max_contiguous_tokens
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     return self._get_mm_max_tokens(seq_len, mm_counts, mm_embeddings_only=False)
ERROR 01-17 11:48:46 [multiproc_executor.py:743]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 340, in _get_mm_max_tokens
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     max_tokens_per_item = self.processing_info.get_mm_max_tokens_per_item(
ERROR 01-17 11:48:46 [multiproc_executor.py:743]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/multimodal.py", line 61, in get_mm_max_tokens_per_item
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     return {"image": self.get_max_image_tokens()}
ERROR 01-17 11:48:46 [multiproc_executor.py:743]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/multimodal.py", line 68, in get_max_image_tokens
ERROR 01-17 11:48:46 [multiproc_executor.py:743]     mm_tokens = processor._get_num_multimodal_tokens(
ERROR 01-17 11:48:46 [multiproc_executor.py:743]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 11:48:46 [multiproc_executor.py:743] AttributeError: 'Step3VLProcessor' object has no attribute '_get_num_multimodal_tokens'
INFO 01-17 11:48:46 [multiproc_executor.py:702] Parent process exited, terminating worker
INFO 01-17 11:48:46 [multiproc_executor.py:702] Parent process exited, terminating worker
[rank0]:[W117 11:48:46.384880841 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] EngineCore failed to start.
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] Traceback (most recent call last):
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     super().__init__(
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     super().__init__(vllm_config)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     self._init_executor()
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 171, in _init_executor
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 653, in wait_for_ready
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842]     raise e from None
(EngineCore_DP0 pid=127656) ERROR 01-17 11:48:47 [core.py:842] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=127656) Process EngineCore_DP0:
(EngineCore_DP0 pid=127656) Traceback (most recent call last):
(EngineCore_DP0 pid=127656)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=127656)     self.run()
(EngineCore_DP0 pid=127656)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=127656)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=127656)     raise e
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=127656)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=127656)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=127656)     super().__init__(
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=127656)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=127656)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in __init__
(EngineCore_DP0 pid=127656)     super().__init__(vllm_config)
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=127656)     self._init_executor()
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 171, in _init_executor
(EngineCore_DP0 pid=127656)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=127656)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=127656)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 653, in wait_for_ready
(EngineCore_DP0 pid=127656)     raise e from None
(EngineCore_DP0 pid=127656) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.

pytokusu

6 days ago

•

edited 6 days ago

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[10], line 13
      3     tokenizer = AutoTokenizer.from_pretrained(
      4         model_name,
      5         trust_remote_code=True,
   (...)      9         device_map="auto",
     10     )
     11     print(f"Tokenizer loaded. Dictionary size: {tokenizer.vocab_size:,}")
---> 13 llm = LLM(
     14     model_name,
     15     tokenizer=tokenizer,
     16     dtype="bfloat16",
     17     max_num_seqs=max_num_seqs,  # 256 Limiting parallel sequences for stability
     18     max_model_len=max_model_len,  # 32768 / 81920
     19     trust_remote_code=True,  # Required for Qwen
     20 
     21     seed=seed_value,  # Determinism
     22     #max_num_batched_tokens=16*1024,
     23     #disable_log_stats=True,  # Disabling logging for performance
     24     #skip_tokenizer_init=False,  # Critical - Do not skip initialization
     25     #enable_prefix_caching=True,
     26     ###max_paddings=2048,  # 256 or 2048 batch processing optimization
     27     ###block_size=32,  # Optimization for long context
     28 
     29     ###tensor_parallel_size=1,
     30     #enable_expert_parallel=True,  # MoE
     31     ###gpu_memory_utilization=0.96,
     32     ###enforce_eager=True,  # Disabling CUDA Graphs for predictability in offline mode
     33     **llm_cfg
     34 )
     36 if not begin_tokenizer_load:
     37     tokenizer = llm.get_tokenizer()

File /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:343, in LLM.__init__(self, model, runner, convert, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, allowed_media_domains, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, disable_custom_all_reduce, hf_token, hf_overrides, mm_processor_kwargs, pooler_config, override_pooler_config, structured_outputs_config, kv_cache_memory_bytes, compilation_config, logits_processors, **kwargs)
    340 log_non_default_args(engine_args)
    342 # Create the Engine (autoselects V0 vs V1)
--> 343 self.llm_engine = LLMEngine.from_engine_args(
    344     engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
    345 )
    346 self.engine_class = type(self.llm_engine)
    348 self.request_counter = Counter()

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:174, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers, enable_multiprocessing)
    171     enable_multiprocessing = True
    173 # Create the LLMEngine.
--> 174 return cls(
    175     vllm_config=vllm_config,
    176     executor_class=executor_class,
    177     log_stats=not engine_args.disable_log_stats,
    178     usage_context=usage_context,
    179     stat_loggers=stat_loggers,
    180     multiprocess_mode=enable_multiprocessing,
    181 )

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:108, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, aggregate_engine_logging, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
    105     self.output_processor.tracer = tracer
    107 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 108 self.engine_core = EngineCoreClient.make_client(
    109     multiprocess_mode=multiprocess_mode,
    110     asyncio_mode=False,
    111     vllm_config=vllm_config,
    112     executor_class=executor_class,
    113     log_stats=self.log_stats,
    114 )
    116 self.logger_manager: StatLoggerManager | None = None
    117 if self.log_stats:

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:93, in EngineCoreClient.make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
     88     return EngineCoreClient.make_async_mp_client(
     89         vllm_config, executor_class, log_stats
     90     )
     92 if multiprocess_mode and not asyncio_mode:
---> 93     return SyncMPClient(vllm_config, executor_class, log_stats)
     95 return InprocClient(vllm_config, executor_class, log_stats)

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:640, in SyncMPClient.__init__(self, vllm_config, executor_class, log_stats)
    637 def __init__(
    638     self, vllm_config: VllmConfig, executor_class: type[Executor], log_stats: bool
    639 ):
--> 640     super().__init__(
    641         asyncio_mode=False,
    642         vllm_config=vllm_config,
    643         executor_class=executor_class,
    644         log_stats=log_stats,
    645     )
    647     self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1
    648     self.outputs_queue = queue.Queue[EngineCoreOutputs | Exception]()

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:469, in MPClient.__init__(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
    466     self.stats_update_address = client_addresses.get("stats_update_address")
    467 else:
    468     # Engines are managed by this client.
--> 469     with launch_core_engines(vllm_config, executor_class, log_stats) as (
    470         engine_manager,
    471         coordinator,
    472         addresses,
    473     ):
    474         self.resources.coordinator = coordinator
    475         self.resources.engine_manager = engine_manager

File /usr/lib/python3.12/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:907, in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
    904 yield local_engine_manager, coordinator, addresses
    906 # Now wait for engines to start.
--> 907 wait_for_engine_startup(
    908     handshake_socket,
    909     addresses,
    910     engines_to_handshake,
    911     parallel_config,
    912     vllm_config.cache_config,
    913     local_engine_manager,
    914     coordinator.proc if coordinator else None,
    915 )

File /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:964, in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, cache_config, proc_manager, coord_process)
    962     if coord_process is not None and coord_process.exitcode is not None:
    963         finished[coord_process.name] = coord_process.exitcode
--> 964     raise RuntimeError(
    965         "Engine core initialization failed. "
    966         "See root cause above. "
    967         f"Failed core proc(s): {finished}"
    968     )
    970 # Receive HELLO and READY messages from the input socket.
    971 eng_identity, ready_msg_bytes = handshake_socket.recv_multipart()

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

dineshananthi

1 day ago

AlphaLQ

about 7 hours ago

Same Problem
vllm command：vllm serve home/models/pretrain_weights/Step3-VL-10B -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.85 --trust-remote-code
environment:
python3.11
vllm0.11.0

Tingdan

StepFun org about 2 hours ago

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Please reinstall vllm.

joe5566121

11 minutes ago

•

edited 10 minutes ago

Same Problem
vllm command:
vllm serve /model --served-model-name step3-vl:10b --host 0.0.0.0 --gpu-memory-utilization 0.6 -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code

environment:
docker image vllm/vllm-openai:v0.14.0

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment