从Transformers到vLLMMiniCPM-V-4.6-AWQ全框架部署指南【免费下载链接】MiniCPM-V-4.6-AWQ项目地址: https://ai.gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQMiniCPM-V-4.6-AWQ是OpenBMB开源社区推出的轻量级多模态模型基于AWQ量化技术实现高效部署。本文将详细介绍如何通过Transformers和vLLM框架快速部署这一模型让你在消费级GPU上也能体验强大的图像与视频理解能力。模型简介为什么选择MiniCPM-V-4.6-AWQMiniCPM-V-4.6-AWQ作为MiniCPM-V 4.6的AWQ量化版本继承了原模型的三大核心优势超高效架构基于LLaVA-UHD v4技术视觉编码计算量减少50%以上相比Qwen3.5-0.8B实现约1.5倍的token吞吐量多模态能力在OpenCompass、RefCOCO等多个基准测试中达到Qwen3.5 2B级别性能支持单图、多图和视频理解广泛部署支持适配vLLM、SGLang、llama.cpp等主流推理框架提供GGUF、BNB、AWQ、GPTQ等多种量化格式该模型特别适合边缘设备部署已成功适配iOS、Android和HarmonyOS三大移动平台所有边缘适配代码均已开源。环境准备部署前的必要配置在开始部署前请确保你的环境满足以下要求Python 3.8PyTorch 2.0CUDA 11.7推荐12.1以上以获得最佳性能至少4GB显存量化版本首先克隆项目仓库git clone https://gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ cd MiniCPM-V-4.6-AWQ方法一使用Transformers框架部署Transformers是Hugging Face推出的开源库提供了简单易用的API来加载和运行预训练模型。安装依赖pip install transformers[torch]5.7.0 torchvision torchcodecCUDA兼容性提示torchcodec可能与某些CUDA版本存在兼容性问题。如果遇到RuntimeError: Could not load libtorchcodec错误可以使用PyAV替代pip install transformers[torch]5.7.0 torchvision av指定CUDA版本安装pip install transformers5.7.0 torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128加载模型from transformers import AutoModelForImageTextToText, AutoProcessor model_id openbmb/MiniCPM-V-4.6-AWQ processor AutoProcessor.from_pretrained(model_id) model AutoModelForImageTextToText.from_pretrained( model_id, torch_dtypeauto, device_mapauto ) # 推荐使用Flash Attention 2加速需要安装flash-attn # model AutoModelForImageTextToText.from_pretrained( # model_id, # torch_dtypetorch.bfloat16, # attn_implementationflash_attention_2, # device_mapauto, # )图像推理示例messages [ { role: user, content: [ {type: image, url: https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png}, {type: text, text: What causes this phenomenon?}, ], } ] downsample_mode 16x # 使用4x可获得更精细的细节 inputs processor.apply_chat_template( messages, tokenizeTrue, add_generation_promptTrue, return_dictTrue, return_tensorspt, downsample_modedownsample_mode, max_slice_nums36, ).to(model.device) generated_ids model.generate(**inputs, downsample_modedownsample_mode, max_new_tokens512) generated_ids_trimmed [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text processor.batch_decode( generated_ids_trimmed, skip_special_tokensTrue, clean_up_tokenization_spacesFalse ) print(output_text[0])启动Transformers服务Transformers提供了轻量级的OpenAI兼容服务器适合快速测试和中等负载部署pip install transformers[serving]5.7.0 transformers serve openbmb/MiniCPM-V-4.6-AWQ --port 8000 --host 0.0.0.0 --continuous-batching发送请求示例curl -s http://localhost:8000/v1/chat/completions \ -H Content-Type: application/json \ -d { model: openbmb/MiniCPM-V-4.6-AWQ, messages: [{ role: user, content: [ {type: image_url, image_url: {url: https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png}}, {type: text, text: What causes this phenomenon?} ] }] }方法二使用vLLM框架部署vLLM是一个高性能的LLM服务库支持PagedAttention技术可显著提高吞吐量并降低延迟。安装vLLMpip install vllm启动vLLM服务vllm serve openbmb/MiniCPM-V-4.6-AWQ \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --default-chat-template-kwargs {enable_thinking: false}提示如果不需要工具调用功能可以简化命令为vllm serve openbmb/MiniCPM-V-4.6-AWQ --port 8000发送推理请求curl -s http://localhost:8000/v1/chat/completions -H Content-Type: application/json -d { model: openbmb/MiniCPM-V-4.6-AWQ, messages: [{role: user, content: [ {type: image_url, image_url: {url: https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png}}, {type: text, text: What causes this phenomenon?} ]}] }工具调用示例vLLM支持自动工具调用功能示例如下curl -s http://localhost:8000/v1/chat/completions -H Content-Type: application/json -d { model: openbmb/MiniCPM-V-4.6-AWQ, messages: [{role: user, content: [ {type: text, text: 北京的天气} ]}], tools: [{ type: function, function: { name: get_weather, description: Get the current weather for a given location, parameters: { type: object, properties: { location: {type: string, description: City name} }, required: [location] } } }] }高级参数配置无论是使用Transformers还是vLLM都可以通过调整参数来平衡性能和效果参数默认值适用对象描述downsample_mode16x图像 视频视觉token下采样模式。16x为效率优先4x保留更多细节需同时传递给generate()max_slice_nums9图像 视频高分辨率图像分割的最大切片数。图像推荐36视频推荐1max_num_frames128视频视频最大帧数。短视频默认1 FPS长视频自动均匀采样stack_frames1视频每秒采样点数。短视频推荐1长视频推荐3或5其他部署选项除了Transformers和vLLMMiniCPM-V-4.6-AWQ还支持多种部署框架SGLang部署pip install sglang python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6-AWQ --port 30000llama.cpp部署# 首先获取GGUF格式模型 llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080Ollama部署ollama run minicpm-v-4.6在交互会话中直接粘贴图像路径或URL即可与模型对话。总结与展望MiniCPM-V-4.6-AWQ凭借其高效的架构设计和广泛的框架支持成为边缘设备和消费级GPU上部署多模态模型的理想选择。无论是使用Transformers进行快速集成还是通过vLLM获得更高吞吐量都能轻松实现模型的本地化部署。随着移动平台部署代码的开源开发者可以进一步探索在iOS、Android和HarmonyOS设备上的部署方案。对于需要定制化的场景还可以利用LLaMA-Factory或ms-swift等工具进行微调快速适配新的领域和任务。通过本文介绍的方法你可以在短短几分钟内完成MiniCPM-V-4.6-AWQ的部署开启高效多模态AI应用的开发之旅【免费下载链接】MiniCPM-V-4.6-AWQ项目地址: https://ai.gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考
从Transformers到vLLM:MiniCPM-V-4.6-AWQ全框架部署指南
从Transformers到vLLMMiniCPM-V-4.6-AWQ全框架部署指南【免费下载链接】MiniCPM-V-4.6-AWQ项目地址: https://ai.gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQMiniCPM-V-4.6-AWQ是OpenBMB开源社区推出的轻量级多模态模型基于AWQ量化技术实现高效部署。本文将详细介绍如何通过Transformers和vLLM框架快速部署这一模型让你在消费级GPU上也能体验强大的图像与视频理解能力。模型简介为什么选择MiniCPM-V-4.6-AWQMiniCPM-V-4.6-AWQ作为MiniCPM-V 4.6的AWQ量化版本继承了原模型的三大核心优势超高效架构基于LLaVA-UHD v4技术视觉编码计算量减少50%以上相比Qwen3.5-0.8B实现约1.5倍的token吞吐量多模态能力在OpenCompass、RefCOCO等多个基准测试中达到Qwen3.5 2B级别性能支持单图、多图和视频理解广泛部署支持适配vLLM、SGLang、llama.cpp等主流推理框架提供GGUF、BNB、AWQ、GPTQ等多种量化格式该模型特别适合边缘设备部署已成功适配iOS、Android和HarmonyOS三大移动平台所有边缘适配代码均已开源。环境准备部署前的必要配置在开始部署前请确保你的环境满足以下要求Python 3.8PyTorch 2.0CUDA 11.7推荐12.1以上以获得最佳性能至少4GB显存量化版本首先克隆项目仓库git clone https://gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ cd MiniCPM-V-4.6-AWQ方法一使用Transformers框架部署Transformers是Hugging Face推出的开源库提供了简单易用的API来加载和运行预训练模型。安装依赖pip install transformers[torch]5.7.0 torchvision torchcodecCUDA兼容性提示torchcodec可能与某些CUDA版本存在兼容性问题。如果遇到RuntimeError: Could not load libtorchcodec错误可以使用PyAV替代pip install transformers[torch]5.7.0 torchvision av指定CUDA版本安装pip install transformers5.7.0 torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128加载模型from transformers import AutoModelForImageTextToText, AutoProcessor model_id openbmb/MiniCPM-V-4.6-AWQ processor AutoProcessor.from_pretrained(model_id) model AutoModelForImageTextToText.from_pretrained( model_id, torch_dtypeauto, device_mapauto ) # 推荐使用Flash Attention 2加速需要安装flash-attn # model AutoModelForImageTextToText.from_pretrained( # model_id, # torch_dtypetorch.bfloat16, # attn_implementationflash_attention_2, # device_mapauto, # )图像推理示例messages [ { role: user, content: [ {type: image, url: https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png}, {type: text, text: What causes this phenomenon?}, ], } ] downsample_mode 16x # 使用4x可获得更精细的细节 inputs processor.apply_chat_template( messages, tokenizeTrue, add_generation_promptTrue, return_dictTrue, return_tensorspt, downsample_modedownsample_mode, max_slice_nums36, ).to(model.device) generated_ids model.generate(**inputs, downsample_modedownsample_mode, max_new_tokens512) generated_ids_trimmed [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text processor.batch_decode( generated_ids_trimmed, skip_special_tokensTrue, clean_up_tokenization_spacesFalse ) print(output_text[0])启动Transformers服务Transformers提供了轻量级的OpenAI兼容服务器适合快速测试和中等负载部署pip install transformers[serving]5.7.0 transformers serve openbmb/MiniCPM-V-4.6-AWQ --port 8000 --host 0.0.0.0 --continuous-batching发送请求示例curl -s http://localhost:8000/v1/chat/completions \ -H Content-Type: application/json \ -d { model: openbmb/MiniCPM-V-4.6-AWQ, messages: [{ role: user, content: [ {type: image_url, image_url: {url: https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png}}, {type: text, text: What causes this phenomenon?} ] }] }方法二使用vLLM框架部署vLLM是一个高性能的LLM服务库支持PagedAttention技术可显著提高吞吐量并降低延迟。安装vLLMpip install vllm启动vLLM服务vllm serve openbmb/MiniCPM-V-4.6-AWQ \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --default-chat-template-kwargs {enable_thinking: false}提示如果不需要工具调用功能可以简化命令为vllm serve openbmb/MiniCPM-V-4.6-AWQ --port 8000发送推理请求curl -s http://localhost:8000/v1/chat/completions -H Content-Type: application/json -d { model: openbmb/MiniCPM-V-4.6-AWQ, messages: [{role: user, content: [ {type: image_url, image_url: {url: https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png}}, {type: text, text: What causes this phenomenon?} ]}] }工具调用示例vLLM支持自动工具调用功能示例如下curl -s http://localhost:8000/v1/chat/completions -H Content-Type: application/json -d { model: openbmb/MiniCPM-V-4.6-AWQ, messages: [{role: user, content: [ {type: text, text: 北京的天气} ]}], tools: [{ type: function, function: { name: get_weather, description: Get the current weather for a given location, parameters: { type: object, properties: { location: {type: string, description: City name} }, required: [location] } } }] }高级参数配置无论是使用Transformers还是vLLM都可以通过调整参数来平衡性能和效果参数默认值适用对象描述downsample_mode16x图像 视频视觉token下采样模式。16x为效率优先4x保留更多细节需同时传递给generate()max_slice_nums9图像 视频高分辨率图像分割的最大切片数。图像推荐36视频推荐1max_num_frames128视频视频最大帧数。短视频默认1 FPS长视频自动均匀采样stack_frames1视频每秒采样点数。短视频推荐1长视频推荐3或5其他部署选项除了Transformers和vLLMMiniCPM-V-4.6-AWQ还支持多种部署框架SGLang部署pip install sglang python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6-AWQ --port 30000llama.cpp部署# 首先获取GGUF格式模型 llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080Ollama部署ollama run minicpm-v-4.6在交互会话中直接粘贴图像路径或URL即可与模型对话。总结与展望MiniCPM-V-4.6-AWQ凭借其高效的架构设计和广泛的框架支持成为边缘设备和消费级GPU上部署多模态模型的理想选择。无论是使用Transformers进行快速集成还是通过vLLM获得更高吞吐量都能轻松实现模型的本地化部署。随着移动平台部署代码的开源开发者可以进一步探索在iOS、Android和HarmonyOS设备上的部署方案。对于需要定制化的场景还可以利用LLaMA-Factory或ms-swift等工具进行微调快速适配新的领域和任务。通过本文介绍的方法你可以在短短几分钟内完成MiniCPM-V-4.6-AWQ的部署开启高效多模态AI应用的开发之旅【免费下载链接】MiniCPM-V-4.6-AWQ项目地址: https://ai.gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考