NVIDIA GPU 算力速查表https://blog.csdn.net/jacke121/article/details/159576930vllm 安装pip install vllmv0.9.0 transformers4.51.3 numpy1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-hostmirrors.aliyun.com # for vllm0.11.0 pip install vllmv0.11.0 transformers4.57.1 numpy1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-hostmirrors.aliyun.com python vllm_example.py结合你的 GPU 是 RTX 5090Compute Capability 12.0而 vLLM 0.9.0 包含 SM 7.0-9.0 的内核。vllmv0.11.0 支持12.0的算力。RTX 4080 的 Compute Capability 是8.9和 RTX 4090、RTX 4070 等整个 40 系列一样。/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.pyINFO 03-28 11:23:44 [__init__.py:243] Automatically detected platform cuda.Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.2026-03-28 11:23:52,335 INFO input frame rate25/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.WeightNorm.apply(module, name, dim)/data/lbg/envs/flashtalk/lib/python3.10/site-packages/pyworld/__init__.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools81.import pkg_resources/data/lbg/envs/flashtalk/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider CUDAExecutionProvider is not in available provider names.Available providers: AzureExecutionProvider, CPUExecutionProviderwarnings.warn(2026-03-28 11:23:56,030 INFO no frontend is avaliableINFO 03-28 11:23:59 [__init__.py:31] Available plugins for group vllm.general_plugins:INFO 03-28 11:23:59 [__init__.py:33] - lora_filesystem_resolver - vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolverINFO 03-28 11:23:59 [__init__.py:36] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.INFO 03-28 11:24:06 [config.py:793] This model supports multiple tasks: {reward, generate, classify, score, embed}. Defaulting to generate.WARNING 03-28 11:24:06 [arg_utils.py:1583] --enable-prompt-embeds is not supported by the V1 Engine. Falling back to V0.INFO 03-28 11:24:06 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0) with config: model/data/lbg/models/CosyVoice3-0.5B/vllm, speculative_configNone, tokenizer/data/lbg/models/CosyVoice3-0.5B/vllm, skip_tokenizer_initTrue, tokenizer_modeauto, revisionNone, override_neuron_config{}, tokenizer_revisionNone, trust_remote_codeFalse, dtypetorch.bfloat16, max_seq_len32768, download_dirNone, load_formatauto, tensor_parallel_size1, pipeline_parallel_size1, disable_custom_all_reduceFalse, quantizationNone, enforce_eagerFalse, kv_cache_dtypeauto, device_configcuda, decoding_configDecodingConfig(backendauto, disable_fallbackFalse, disable_any_whitespaceFalse, disable_additional_propertiesFalse, reasoning_backend), observability_configObservabilityConfig(show_hidden_metrics_for_versionNone, otlp_traces_endpointNone, collect_detailed_tracesNone), seed0, served_model_name/data/lbg/models/CosyVoice3-0.5B/vllm, num_scheduler_steps1, multi_step_stream_outputsTrue, enable_prefix_cachingNone, chunked_prefill_enabledFalse, use_async_output_procTrue, pooler_configNone, compilation_config{compile_sizes: [], inductor_compile_config: {enable_auto_functionalized_v2: false}, cudagraph_capture_sizes: [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], max_capture_size: 256}, use_cached_outputsFalse,INFO 03-28 11:24:06 [cuda.py:292] Using Flash Attention backend.[W328 11:24:17.372136289 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err-3INFO 03-28 11:24:22 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0INFO 03-28 11:24:22 [model_runner.py:1170] Starting to load model /data/lbg/models/CosyVoice3-0.5B/vllm...Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00?, ?it/s]Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00, 2.01it/s]Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00, 2.01it/s]INFO 03-28 11:24:23 [default_loader.py:280] Loading weights took 0.52 secondsINFO 03-28 11:24:23 [model_runner.py:1202] Model loading took 0.7001 GiB and 0.599737 seconds[rank0]: Traceback (most recent call last):[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py, line 40, in module[rank0]: main()[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py, line 36, in main[rank0]: cosyvoice3_example()[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py, line 26, in cosyvoice3_example[rank0]: cosyvoice AutoModel(model_dir/data/lbg/models/CosyVoice3-0.5B, load_trtTrue, load_vllmTrue, fp16False)[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py, line 236, in AutoModel[rank0]: return CosyVoice3(**kwargs)[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py, line 217, in __init__[rank0]: self.model.load_vllm({}/vllm.format(model_dir))[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/model.py, line 288, in load_vllm[rank0]: self.llm.vllm LLMEngine.from_engine_args(engine_args)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 501, in from_engine_args[rank0]: return engine_cls.from_vllm_config([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 477, in from_vllm_config[rank0]: return cls([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 268, in __init__[rank0]: self._initialize_kv_caches()[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 413, in _initialize_kv_caches[rank0]: self.model_executor.determine_num_available_blocks())[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/executor_base.py, line 103, in determine_num_available_blocks[rank0]: results self.collective_rpc(determine_num_available_blocks)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py, line 56, in collective_rpc[rank0]: answer run_method(self.driver_worker, method, args, kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/utils.py, line 2605, in run_method[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py, line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/worker.py, line 253, in determine_num_available_blocks[rank0]: self.model_runner.profile_run()[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py, line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py, line 1299, in profile_run[rank0]: self._dummy_run(max_num_batched_tokens, max_num_seqs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py, line 1425, in _dummy_run[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py, line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py, line 1843, in execute_model[rank0]: hidden_or_intermediate_states model_executable([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 481, in forward[rank0]: hidden_states self.model(input_ids, positions, intermediate_tensors,[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/compilation/decorators.py, line 172, in __call__[rank0]: return self.forward(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 358, in forward[rank0]: hidden_states, residual layer([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 257, in forward[rank0]: hidden_states self.self_attn([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 184, in forward[rank0]: qkv, _ self.qkv_proj(hidden_states)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py, line 486, in forward[rank0]: output_parallel self.quant_method.apply(self, input_, bias)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py, line 202, in apply[rank0]: return dispatch_unquantized_gemm()(x, layer.weight, bias)[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING1[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.[rank0]:[W328 11:24:24.967138206 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
vllm 踩坑记录 算力匹配
NVIDIA GPU 算力速查表https://blog.csdn.net/jacke121/article/details/159576930vllm 安装pip install vllmv0.9.0 transformers4.51.3 numpy1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-hostmirrors.aliyun.com # for vllm0.11.0 pip install vllmv0.11.0 transformers4.57.1 numpy1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-hostmirrors.aliyun.com python vllm_example.py结合你的 GPU 是 RTX 5090Compute Capability 12.0而 vLLM 0.9.0 包含 SM 7.0-9.0 的内核。vllmv0.11.0 支持12.0的算力。RTX 4080 的 Compute Capability 是8.9和 RTX 4090、RTX 4070 等整个 40 系列一样。/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.pyINFO 03-28 11:23:44 [__init__.py:243] Automatically detected platform cuda.Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.2026-03-28 11:23:52,335 INFO input frame rate25/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.WeightNorm.apply(module, name, dim)/data/lbg/envs/flashtalk/lib/python3.10/site-packages/pyworld/__init__.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools81.import pkg_resources/data/lbg/envs/flashtalk/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider CUDAExecutionProvider is not in available provider names.Available providers: AzureExecutionProvider, CPUExecutionProviderwarnings.warn(2026-03-28 11:23:56,030 INFO no frontend is avaliableINFO 03-28 11:23:59 [__init__.py:31] Available plugins for group vllm.general_plugins:INFO 03-28 11:23:59 [__init__.py:33] - lora_filesystem_resolver - vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolverINFO 03-28 11:23:59 [__init__.py:36] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.INFO 03-28 11:24:06 [config.py:793] This model supports multiple tasks: {reward, generate, classify, score, embed}. Defaulting to generate.WARNING 03-28 11:24:06 [arg_utils.py:1583] --enable-prompt-embeds is not supported by the V1 Engine. Falling back to V0.INFO 03-28 11:24:06 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0) with config: model/data/lbg/models/CosyVoice3-0.5B/vllm, speculative_configNone, tokenizer/data/lbg/models/CosyVoice3-0.5B/vllm, skip_tokenizer_initTrue, tokenizer_modeauto, revisionNone, override_neuron_config{}, tokenizer_revisionNone, trust_remote_codeFalse, dtypetorch.bfloat16, max_seq_len32768, download_dirNone, load_formatauto, tensor_parallel_size1, pipeline_parallel_size1, disable_custom_all_reduceFalse, quantizationNone, enforce_eagerFalse, kv_cache_dtypeauto, device_configcuda, decoding_configDecodingConfig(backendauto, disable_fallbackFalse, disable_any_whitespaceFalse, disable_additional_propertiesFalse, reasoning_backend), observability_configObservabilityConfig(show_hidden_metrics_for_versionNone, otlp_traces_endpointNone, collect_detailed_tracesNone), seed0, served_model_name/data/lbg/models/CosyVoice3-0.5B/vllm, num_scheduler_steps1, multi_step_stream_outputsTrue, enable_prefix_cachingNone, chunked_prefill_enabledFalse, use_async_output_procTrue, pooler_configNone, compilation_config{compile_sizes: [], inductor_compile_config: {enable_auto_functionalized_v2: false}, cudagraph_capture_sizes: [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], max_capture_size: 256}, use_cached_outputsFalse,INFO 03-28 11:24:06 [cuda.py:292] Using Flash Attention backend.[W328 11:24:17.372136289 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err-3INFO 03-28 11:24:22 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0INFO 03-28 11:24:22 [model_runner.py:1170] Starting to load model /data/lbg/models/CosyVoice3-0.5B/vllm...Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00?, ?it/s]Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00, 2.01it/s]Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00, 2.01it/s]INFO 03-28 11:24:23 [default_loader.py:280] Loading weights took 0.52 secondsINFO 03-28 11:24:23 [model_runner.py:1202] Model loading took 0.7001 GiB and 0.599737 seconds[rank0]: Traceback (most recent call last):[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py, line 40, in module[rank0]: main()[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py, line 36, in main[rank0]: cosyvoice3_example()[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py, line 26, in cosyvoice3_example[rank0]: cosyvoice AutoModel(model_dir/data/lbg/models/CosyVoice3-0.5B, load_trtTrue, load_vllmTrue, fp16False)[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py, line 236, in AutoModel[rank0]: return CosyVoice3(**kwargs)[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py, line 217, in __init__[rank0]: self.model.load_vllm({}/vllm.format(model_dir))[rank0]: File /data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/model.py, line 288, in load_vllm[rank0]: self.llm.vllm LLMEngine.from_engine_args(engine_args)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 501, in from_engine_args[rank0]: return engine_cls.from_vllm_config([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 477, in from_vllm_config[rank0]: return cls([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 268, in __init__[rank0]: self._initialize_kv_caches()[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 413, in _initialize_kv_caches[rank0]: self.model_executor.determine_num_available_blocks())[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/executor_base.py, line 103, in determine_num_available_blocks[rank0]: results self.collective_rpc(determine_num_available_blocks)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py, line 56, in collective_rpc[rank0]: answer run_method(self.driver_worker, method, args, kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/utils.py, line 2605, in run_method[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py, line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/worker.py, line 253, in determine_num_available_blocks[rank0]: self.model_runner.profile_run()[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py, line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py, line 1299, in profile_run[rank0]: self._dummy_run(max_num_batched_tokens, max_num_seqs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py, line 1425, in _dummy_run[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py, line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py, line 1843, in execute_model[rank0]: hidden_or_intermediate_states model_executable([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 481, in forward[rank0]: hidden_states self.model(input_ids, positions, intermediate_tensors,[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/compilation/decorators.py, line 172, in __call__[rank0]: return self.forward(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 358, in forward[rank0]: hidden_states, residual layer([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 257, in forward[rank0]: hidden_states self.self_attn([rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py, line 184, in forward[rank0]: qkv, _ self.qkv_proj(hidden_states)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py, line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py, line 486, in forward[rank0]: output_parallel self.quant_method.apply(self, input_, bias)[rank0]: File /data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py, line 202, in apply[rank0]: return dispatch_unquantized_gemm()(x, layer.weight, bias)[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING1[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.[rank0]:[W328 11:24:24.967138206 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())