LLM上下文工程：从Prompt设计到记忆系统的架构演进-尧图企业网站定制

引言Context Engineering 的崛起2026年“上下文工程”Context Engineering已取代提示工程Prompt Engineering成为 LLM 应用开发的核心范式。这一转变反映了行业的深刻认知LLM 的能力上限不在于模型参数而在于输入上下文的质量和结构。提示工程关注如何问对问题而上下文工程关注如何构建一个完整的认知环境让 LLM 高效工作。这包括动态上下文组装、长程记忆管理、工具结果编排、多轮对话状态压缩等多个维度。一个设计良好的上下文工程系统可以让一个 7B 模型的表现超越未优化上下文的 70B 模型。## 从 Prompt Engineering 到 Context Engineering### 范式转变的驱动因素| 维度 | Prompt Engineering | Context Engineering ||------|-------------------|-------------------|| 关注点 | 单次提示的措辞 | 完整的上下文环境构建 || 上下文管理 | 手动拼接 | 自动化动态组装 || 记忆机制 | 无每次从零开始 | 长期短期工作记忆 || 工具集成 | 硬编码在提示中 | 动态选择和编排 || 可维护性 | 修改提示词风险高 | 模块化、可测试 || 适用规模 | 单轮简单任务 | 多轮复杂 Agent 任务 |### 上下文窗口的注意力预算LLM 的上下文窗口虽然已扩展到百万 token但并非越长越好。研究表明注意力在长上下文中呈U型分布——开头和结尾的信息被更好地利用中间部分容易被忽略。pythonclass ContextBudget: 上下文注意力预算管理器 def __init__(self, max_tokens128000): self.max_tokens max_tokens # 预算分配策略 self.budget { system_prompt: 2000, # 系统提示 user_message: 4000, # 用户消息 retrieved_context: 40000, # 检索上下文 memory: 20000, # 记忆 tool_results: 30000, # 工具结果 few_shot: 10000, # 示例 scratchpad: 22000, # 推理草稿 } def allocate(self, actual_usage: dict) - dict: 根据实际使用动态调整预算 total sum(actual_usage.values()) if total self.max_tokens: return actual_usage # 按优先级压缩 priority [system_prompt, user_message, tool_results, retrieved_context, few_shot, memory, scratchpad] adjusted actual_usage.copy() for key in reversed(priority): if total self.max_tokens: break overflow total - self.max_tokens can_reduce max(0, adjusted[key] - adjusted[key] * 0.3) reduction min(overflow, can_reduce) adjusted[key] - reduction total - reduction return adjustedtext## 上下文工程的四大支柱### 支柱一动态上下文组装传统做法是将所有信息堆砌在提示中。上下文工程的做法是根据任务类型、用户意图和对话历史动态选择和组织上下文。pythonfrom dataclasses import dataclass, fieldfrom typing import List, Dict, Optionaldataclassclass ContextBlock: 上下文块最小可管理单元 content: str source: str # 来源system/user/memory/tool/retrieval priority: int # 优先级1最高 token_count: int compressible: bool True metadata: Dict field(default_factorydict)class DynamicContextAssembler: 动态上下文组装器 def __init__(self, tokenizer, max_tokens128000): self.tokenizer tokenizer self.max_tokens max_tokens def assemble(self, blocks: List[ContextBlock], query: str) - str: 根据查询动态组装最优上下文 # 1. 意图分析 intent self._analyze_intent(query) # 2. 相关性排序 scored_blocks [] for block in blocks: relevance self._score_relevance(block, query, intent) scored_blocks.append((block, relevance)) scored_blocks.sort(keylambda x: (-x[1], x[0].priority)) # 3. 预算分配 system_blocks [b for b, _ in scored_blocks if b.source system] user_blocks [b for b, _ in scored_blocks if b.source user] other_blocks [b for b, _ in scored_blocks if b.source not in (system, user)] # 4. 贪心填充 result [] used_tokens 0 # 系统提示优先 for block in system_blocks: if used_tokens block.token_count self.max_tokens: result.append(block) used_tokens block.token_count # 用户消息次之 for block in user_blocks: if used_tokens block.token_count self.max_tokens: result.append(block) used_tokens block.token_count # 其他按相关性填充 for block, score in scored_blocks: if block.source in (system, user): continue if used_tokens block.token_count self.max_tokens: result.append(block) used_tokens block.token_count elif block.compressible: # 压缩后尝试放入 compressed self._compress(block, self.max_tokens - used_tokens) if compressed: result.append(compressed) used_tokens compressed.token_count # 5. 结构化输出 return self._format_context(result)text### 支柱二多级记忆系统人类大脑有感觉记忆、短期记忆、长期记忆三层结构。LLM Agent 同样需要多级记忆系统来维持跨会话的连续性。pythonclass MemorySystem: 三级记忆架构 def __init__(self, llm, vector_store, max_working4000): self.llm llm self.vector_store vector_store self.max_working max_working # 工作记忆当前对话上下文 self.working_memory: List[Dict] [] # 情景记忆具体事件记录 self.episodic_store vector_store.collection(episodic) # 语义记忆抽象知识 self.semantic_store vector_store.collection(semantic) def add(self, event: Dict): 添加新记忆 # 1. 加入工作记忆 self.working_memory.append(event) # 2. 工作记忆溢出时摘要并转移到长期记忆 working_tokens self._count_tokens(self.working_memory) if working_tokens self.max_working: self._consolidate() def _consolidate(self): 记忆巩固工作记忆 → 长期记忆 # 将最旧的工作记忆摘要 old_memories self.working_memory[:len(self.working_memory)//2] recent_memories self.working_memory[len(self.working_memory)//2:] # 生成摘要 summary self.llm.generate( f将以下对话历史总结为关键信息\n f{self._format(old_memories)}\n f输出格式[事实] / [决策] / [偏好] / [待办] ) # 分类存储 for line in summary.split(\n): if line.startswith([事实]): self.semantic_store.add({content: line, type: fact}) elif line.startswith([决策]): self.episodic_store.add({content: line, type: decision}) elif line.startswith([偏好]): self.semantic_store.add({content: line, type: preference}) elif line.startswith([待办]): self.episodic_store.add({content: line, type: todo}) # 保留最近的工作记忆摘要 self.working_memory [ {role: system, content: f[之前的对话摘要]\n{summary}}, *recent_memories ] def recall(self, query: str, top_k: int 5) - List[Dict]: 回忆与查询相关的记忆 results [] # 语义记忆抽象知识 semantic_hits self.semantic_store.search(query, top_ktop_k) results.extend([{type: semantic, **h} for h in semantic_hits]) # 情景记忆具体事件 episodic_hits self.episodic_store.search(query, top_ktop_k) results.extend([{type: episodic, **h} for h in episodic_hits]) # 工作记忆最近上下文 results.extend([{type: working, content: m} for m in self.working_memory[-5:]]) return resultstext记忆系统性能对比| 记忆策略 | 上下文Token | 响应一致性 | 幻觉率 | 延迟 ||---------|------------|-----------|--------|------|| 无记忆全量历史 | 80K | 95% | 12% | 3.2s || 简单截断最近N轮 | 4K | 60% | 8% | 0.8s || 摘要压缩 | 8K | 78% | 10% | 1.5s || 三级记忆系统 | 12K | 89% | 6% | 1.8s |### 支柱三工具结果编排Agent 调用工具后返回的结果往往包含大量冗余信息。上下文工程需要对这些结果进行智能编排。pythonclass ToolResultCurator: 工具结果策展器 def curate(self, tool_name: str, raw_result: dict, query: str, token_budget: int) - str: 策展工具结果 strategies { web_search: self._curate_search, code_execution: self._curate_code, database_query: self._curate_db, file_read: self._curate_file, } curator strategies.get(tool_name, self._curate_generic) return curator(raw_result, query, token_budget) def _curate_search(self, result: dict, query: str, budget: int) - str: 策展搜索结果 formatted [] for item in result.get(results, []): # 提取最相关的段落 snippet self._extract_relevant_passage( item[content], query, max_tokens200 ) formatted.append( f**{item[title]}**\n来源: {item[url]}\n{snippet} ) return \n\n---\n\n.join(formatted)[:budget*4] # 粗略token估算 def _curate_code(self, result: dict, query: str, budget: int) - str: 策展代码执行结果 # 只保留 stdout 的最后 N 行错误信息 stdout result.get(stdout, ) stderr result.get(stderr, ) lines stdout.strip().split(\n) if len(lines) 20: stdout \n.join(lines[:5] [...省略N行...] lines[-15:]) parts [] if stdout: parts.append(f输出:\n\n{stdout}\n) if stderr: parts.append(f错误:\n\n{stderr}\n) return \n\n.join(parts)text### 支柱四上下文压缩与摘要当上下文接近窗口限制时需要智能压缩而非简单截断。pythonclass ContextCompressor: 上下文压缩器 def __init__(self, llm): self.llm llm def compress(self, messages: List[Dict], target_tokens: int) - List[Dict]: 将消息列表压缩到目标token数 current_tokens self._count_tokens(messages) if current_tokens target_tokens: return messages # 策略1移除低价值消息 messages self._remove_low_value(messages, target_tokens) # 策略2合并连续的工具调用 messages self._merge_tool_calls(messages) # 策略3摘要旧对话 if self._count_tokens(messages) target_tokens: messages self._summarize_old(messages, target_tokens) return messages def _remove_low_value(self, messages: List[Dict], target: int) - List[Dict]: 移除低价值消息 scored [] for i, msg in enumerate(messages): score self._value_score(msg, i, len(messages)) scored.append((i, msg, score)) # 按价值排序保留高价值的 scored.sort(keylambda x: -x[2]) kept [] kept_tokens 0 for i, msg, score in scored: msg_tokens self._count_tokens([msg]) if kept_tokens msg_tokens target: kept.append((i, msg)) kept_tokens msg_tokens # 按原始顺序排列 kept.sort(keylambda x: x[0]) return [msg for _, msg in kept] def _value_score(self, msg: Dict, index: int, total: int) - float: 计算消息的价值分数 score 0.0 # 系统消息价值最高 if msg[role] system: score 10.0 # 最近的消息价值更高 recency (index 1) / total score recency * 5.0 # 包含决策/结论的消息价值更高 content msg.get(content, ) if any(kw in content for kw in [决定, 结论, 因此, 所以]): score 3.0 # 包含代码的消息价值中等 if in content: score 2.0 # 纯确认/寒暄价值低 if len(content) 20: score - 2.0 return scoretext## 生产级架构设计### 上下文工程的完整架构pythonclass ContextEngineeringSystem: “”“上下文工程系统集成所有支柱”“” definit(self, llm, config): self.llm llm self.memory MemorySystem(llm, config.vector_store) self.assembler DynamicContextAssembler(config.tokenizer) self.curator ToolResultCurator() self.compressor ContextCompressor(llm) self.budget ContextBudget(config.max_context_tokens) def build_context(self, user_query: str, conversation_history: List[Dict]) - str: “”“构建完整的推理上下文”“” # 1. 记忆回忆 relevant_memories self.memory.recall(user_query) # 2. 工具结果策展如果有 tool_results self._get_pending_tool_results() curated_results {} for tool, result in tool_results.items(): curated self.curator.curate( tool, result, user_query, self.budget.budget[“tool_results”] // len(tool_results) ) curated_results[tool] curated # 3. 组装上下文块 blocks [] blocks.append(ContextBlock( contentself._system_prompt(), source“system”, priority1, token_countcount_tokens(self._system_prompt()) )) for mem in relevant_memories: blocks.append(ContextBlock( contentf[{mem[‘type’]}] {mem[‘content’]}“, source“memory”, priority2, token_countcount_tokens(mem[“content”]) )) for tool, result in curated_results.items(): blocks.append(ContextBlock( contentf”[工具:{tool}]\n{result}, source“tool”, priority3, token_countcount_tokens(result) )) blocks.append(ContextBlock( contentuser_query, source“user”, priority1, token_countcount_tokens(user_query) )) # 4. 动态组装 context self.assembler.assemble(blocks, user_query) # 5. 如果超限压缩 if count_tokens(context) self.budget.max_tokens: context self.compressor.compress( parse_to_messages(context), self.budget.max_tokens ) return contexttext## 效果评估在内部 Agent 评测集上上下文工程系统的效果| 指标 | 基线简单拼接 | 上下文工程 | 提升 ||------|---------------|-----------|------|| 任务成功率 | 62% | 84% | 35% || 上下文利用率 | 45% | 78% | 73% || 平均Token消耗 | 95K | 32K | -66% || 幻觉率 | 14% | 5% | -64% || 端到端延迟 | 4.2s | 1.8s | -57% |## 结语上下文工程是 LLM 应用从能跑到好用的关键一跃。它不是单一的技巧或模板而是一套系统化的方法论——动态组装确保相关性多级记忆保证连续性工具策展提升信息密度智能压缩控制成本。对于工程团队而言建议从最简单的动态组装开始逐步引入记忆系统和压缩机制用 A/B 测试验证每一步优化的实际收益。

相关新闻

多个组的融合

宝宝照片视频高效同步给父母！实测5款亲子共享工具最优选型

一、架构的定义

嵌入式设备安全连接实践：PIC18F25K42与A5000芯片TLS通信

西安无人茶室小程序开发哪家靠谱，自助开门物联网对接方案

Agent工程-Agent Skills到底是什么-为什么现在大家都在做技能包

#单词卡片生成器 - 拍照识物学单词

Agent知识层-codebase-memory-mcp和代码知识图谱到底值不值得做

本地生活 GEO 优化实战：实体锚点、信源矩阵与场景化知识库的构建逻辑

塞尔达传说旷野之息存档修改器：3分钟掌握海拉鲁世界自由定制技巧

基于Si4731和STM32的智能收音系统开发指南

错过这6个SonarLint高级技巧，你在IDEA里写的每行代码都可能成为生产事故源头——资深架构师20年代码治理血泪总结

管理者的六个层次

审计来了，数据权限全开——审计走了，怎么确保权限全部关掉？

38.工业通用 PLC 分拣模板！传感器去抖 + 气缸互锁 + 状态机 + 超时报警全套

基于Dify与DeepSeek构建私有知识库问答系统实战指南

YOLOv8推理性能优化：从1.2FPS到35FPS的全链路加速实践

NVIDIA显示器色彩校准终极指南：5分钟实现专业级sRGB色彩还原