第10章:自动化运维体系10.1 为什么需要自动化运维在大规模ES集群运维中,手动运维面临以下挑战:手动运维的痛点:效率低下: 100个集群,手动配置耗时巨大配置不一致: 手动配置容易出错,配置不一致响应慢: 故障时手动操作响应慢,影响SLA不可追溯: 手动操作难以追溯,无法回滚自动化运维的价值:效率提升: 自动化操作,效率提升10倍以上一致性: 配置即代码,保证配置一致快速响应: 自动化故障处理,快速恢复可追溯: 所有操作版本控制,可回滚10.2 基础设施即代码(Terraform)Terraform概述Terraform: 基础设施即代码工具,用于管理云资源。优势:声明式配置: 描述期望状态,Terraform自动实现版本控制: 配置文件Git管理,可追溯可回滚跨云支持: 支持AWS、GCP、Azure等云厂商Terraform配置示例文件:main.tf# 配置AWS Provider provider "aws" { region = "us-east-1" } # 创建VPC resource "aws_vpc" "es_vpc" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true tags = { Name = "elasticsearch-vpc" } } # 创建子网 resource "aws_subnet" "es_subnet" { count = 3 vpc_id = aws_vpc.es_vpc.id cidr_block = "10.0.${count.index}.0/24" availability_zone = data.aws_availability_zones.available.names[count.index] map_public_ip_on_launch = true tags = { Name = "elasticsearch-subnet-${count.index}" } } # 创建安全组 resource "aws_security_group" "es_sg" { name = "elasticsearch-sg" description = "Elasticsearch security group" vpc_id = aws_vpc.es_vpc.id # HTTP端口 ingress { from_port = 9200 to_port = 9200 protocol = "tcp" cidr_blocks = ["10.0.0.0/16"] } # Transport端口 ingress { from_port = 9300 to_port = 9300 protocol = "tcp" cidr_blocks = ["10.0.0.0/16"] } # SSH端口 ingress { from_port = 22 to_port = 22 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } # 创建Master节点 resource "aws_instance" "master" { count = 3 ami = "ami-0c55b159cb8fe0f00" instance_type = "m5.large" subnet_id = aws_subnet.es_subnet[count.index % 3].id vpc_security_group_ids = [aws_security_group.es_sg.id] key_name = "my-key-pair" user_data = -EOF #!/bin/bash # 安装ES wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.2-linux-x86_64.tar.gz tar -xzf elasticsearch-8.15.2-linux-x86_64.tar.gz EOF tags = { Name = "elasticsearch-master-${count.index}" Role = "master" } } # 创建Data节点 resource "aws_instance" "data" { count = 6 ami = "ami-0c55b159cb8fe0f00" instance_type = "r5.2xlarge" subnet_id = aws_subnet.es_subnet[count.index % 3].id vpc_security_group_ids = [aws_security_group.es_sg.id] key_name = "my-key-pair" root_block_device { volume_size = 100 volume_type = "gp3" } ebs_block_device { device_name = "/dev/sdb" volume_size = 4000 volume_type = "gp3" } tags = { Name = "elasticsearch-data-${count.index}" Role = "data" } }Terraform操作# 初始化terraform init# 规划(预览变更)terraform plan# 应用(创建资源)terraform apply# 销毁(删除资源)terraform destroy10.3 配置管理(Ansible)Ansible概述Ansible: 配置管理工具,用于批量配置服务器。优势:无Agent: SSH连接,无需安装Agent幂等性: 多次执行结果一致模块化: 丰富的模块,易于扩展Ansible配置示例文件:inventory[master] master1 ansible_host=10.0.1.10 master2 ansible_host=10.0.1.11 master3 ansible_host=10.0.1.12 [data_hot] data1 ansible_host=10.0.2.10 data2 ansible_host=10.0.2.11 data3 ansible_host=10.0.2.12 [data_cold] data4 ansible_host=10.0.3.10 data5 ansible_host=10.0.3.11 data6 ansible_host=10.0.3.12 [coordinating] coord1 ansible_host=10.0.4.10 coord2 ansible_host=10.0.4.11 [all:vars] ansible_user=centos ansible_ssh_private_key_file=~/.ssh/my-key.pem文件:playbook.yml----name:Install Elasticsearchhosts:allbecome:yesvars:es_version:"8.15.2"es_cluster_name:"my-es-cluster"tasks:# 安装Java-name:Install Javayum:name:java-11-openjdkstate:present# 创建ES用户-name
第10章:自动化运维体系
第10章:自动化运维体系10.1 为什么需要自动化运维在大规模ES集群运维中,手动运维面临以下挑战:手动运维的痛点:效率低下: 100个集群,手动配置耗时巨大配置不一致: 手动配置容易出错,配置不一致响应慢: 故障时手动操作响应慢,影响SLA不可追溯: 手动操作难以追溯,无法回滚自动化运维的价值:效率提升: 自动化操作,效率提升10倍以上一致性: 配置即代码,保证配置一致快速响应: 自动化故障处理,快速恢复可追溯: 所有操作版本控制,可回滚10.2 基础设施即代码(Terraform)Terraform概述Terraform: 基础设施即代码工具,用于管理云资源。优势:声明式配置: 描述期望状态,Terraform自动实现版本控制: 配置文件Git管理,可追溯可回滚跨云支持: 支持AWS、GCP、Azure等云厂商Terraform配置示例文件:main.tf# 配置AWS Provider provider "aws" { region = "us-east-1" } # 创建VPC resource "aws_vpc" "es_vpc" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true tags = { Name = "elasticsearch-vpc" } } # 创建子网 resource "aws_subnet" "es_subnet" { count = 3 vpc_id = aws_vpc.es_vpc.id cidr_block = "10.0.${count.index}.0/24" availability_zone = data.aws_availability_zones.available.names[count.index] map_public_ip_on_launch = true tags = { Name = "elasticsearch-subnet-${count.index}" } } # 创建安全组 resource "aws_security_group" "es_sg" { name = "elasticsearch-sg" description = "Elasticsearch security group" vpc_id = aws_vpc.es_vpc.id # HTTP端口 ingress { from_port = 9200 to_port = 9200 protocol = "tcp" cidr_blocks = ["10.0.0.0/16"] } # Transport端口 ingress { from_port = 9300 to_port = 9300 protocol = "tcp" cidr_blocks = ["10.0.0.0/16"] } # SSH端口 ingress { from_port = 22 to_port = 22 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } # 创建Master节点 resource "aws_instance" "master" { count = 3 ami = "ami-0c55b159cb8fe0f00" instance_type = "m5.large" subnet_id = aws_subnet.es_subnet[count.index % 3].id vpc_security_group_ids = [aws_security_group.es_sg.id] key_name = "my-key-pair" user_data = -EOF #!/bin/bash # 安装ES wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.2-linux-x86_64.tar.gz tar -xzf elasticsearch-8.15.2-linux-x86_64.tar.gz EOF tags = { Name = "elasticsearch-master-${count.index}" Role = "master" } } # 创建Data节点 resource "aws_instance" "data" { count = 6 ami = "ami-0c55b159cb8fe0f00" instance_type = "r5.2xlarge" subnet_id = aws_subnet.es_subnet[count.index % 3].id vpc_security_group_ids = [aws_security_group.es_sg.id] key_name = "my-key-pair" root_block_device { volume_size = 100 volume_type = "gp3" } ebs_block_device { device_name = "/dev/sdb" volume_size = 4000 volume_type = "gp3" } tags = { Name = "elasticsearch-data-${count.index}" Role = "data" } }Terraform操作# 初始化terraform init# 规划(预览变更)terraform plan# 应用(创建资源)terraform apply# 销毁(删除资源)terraform destroy10.3 配置管理(Ansible)Ansible概述Ansible: 配置管理工具,用于批量配置服务器。优势:无Agent: SSH连接,无需安装Agent幂等性: 多次执行结果一致模块化: 丰富的模块,易于扩展Ansible配置示例文件:inventory[master] master1 ansible_host=10.0.1.10 master2 ansible_host=10.0.1.11 master3 ansible_host=10.0.1.12 [data_hot] data1 ansible_host=10.0.2.10 data2 ansible_host=10.0.2.11 data3 ansible_host=10.0.2.12 [data_cold] data4 ansible_host=10.0.3.10 data5 ansible_host=10.0.3.11 data6 ansible_host=10.0.3.12 [coordinating] coord1 ansible_host=10.0.4.10 coord2 ansible_host=10.0.4.11 [all:vars] ansible_user=centos ansible_ssh_private_key_file=~/.ssh/my-key.pem文件:playbook.yml----name:Install Elasticsearchhosts:allbecome:yesvars:es_version:"8.15.2"es_cluster_name:"my-es-cluster"tasks:# 安装Java-name:Install Javayum:name:java-11-openjdkstate:present# 创建ES用户-name