一、Linux 基础命令命令功能常用选项cd切换目录.当前目录..上级目录ls列出文件-a显示隐藏-l详细信息-R递归cat查看文件-n显示行号mkdir创建目录-p创建父目录rm删除文件/目录-f强制-r递归cp复制-f覆盖-R递归mv移动/重命名-f覆盖-i询问pwd显示当前路径–chmod修改权限数字法r4,w2,x1如700sudo以超级用户执行–source使配置生效–hostnamectl修改主机名set-hostnameifconfig查看IP地址–二、JDK 安装与环境变量步骤上传jdk-8u161-linux-x64.tar.gz到/opt解压sudo tar -zxvf jdk-8u161-linux-x64.tar.gz编辑/etc/profile添加bashexport JAVA_HOME/opt/jdk1.8.0_161 export JRE_HOME${JAVA_HOME}/jre export CLASSPATH.:${JAVA_HOME}/lib export PATH${JAVA_HOME}/bin:$PATHsource /etc/profile使其生效验证java -version三、SSH 免密登录配置操作流程生成密钥对ssh-keygen -t rsa一路回车将公钥追加到授权文件bashcat ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys设置权限bashchmod 755 ~ chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys分发公钥到其他节点bashssh-copy-id -i ~/.ssh/id_rsa.pub slave1验证ssh master无需密码即成功四、Hadoop 部署4.1 伪分布式部署单节点配置文件修改位于$HADOOP_HOME/etc/hadoop/文件关键配置hadoop-env.shexport JAVA_HOME/opt/jdk1.8.0_161core-site.xmlfs.defaultFS→hdfs://master:9000hadoop.tmp.dir→ 临时目录hdfs-site.xmldfs.replication→1dfs.namenode.name.dir和dfs.datanode.data.diryarn-site.xmlyarn.resourcemanager.hostname→masteryarn.nodemanager.aux-services→mapreduce_shufflemapred-site.xmlmapreduce.framework.name→yarn启动与验证格式化hdfs namenode -format启动start-all.sh或start-dfs.shstart-yarn.sh进程检查jps应看到NameNode, DataNode, SecondaryNameNode, ResourceManager, NodeManagerWeb 界面HDFS →http://master:50070YARN →http://master:80884.2 完全分布式集群部署节点规划节点HDFS 角色YARN 角色masterNameNodeResourceManagerslave1, slave2DataNodeNodeManager关键配置修改etc/hosts添加所有节点的 IP 与主机名映射masters文件写入masterworkers文件写入slave1、slave2删除 localhosthdfs-site.xmldfs.replication设为2删除临时目录datanode_1_dir中的内容克隆虚拟机后修改 slave 节点修改 IP 地址/etc/netplan/*.yaml修改主机名sudo hostnamectl set-hostname slave1修改hdfs-site.xml注释掉 NameNode 相关配置保留 DataNode 配置启动集群在 master 上格式化hdfs namenode -format启动start-all.sh分别在各节点用jps验证进程五、Hive 安装与配置5.1 MySQL 作为元数据库MySQL 8.0.14安装依赖sudo apt-get install libaio1按顺序安装 deb 包mysql-common,libmysqlclient21, ... ,mysql-community-server注意若安装时未设置 root 密码后续需用sudo mysql进入并执行sqlALTER USER rootlocalhost IDENTIFIED WITH mysql_native_password BY 123456; FLUSH PRIVILEGES;创建用户并授权sqlCREATE USER zeng% IDENTIFIED BY 123456; GRANT ALL ON *.* TO zeng%; CREATE DATABASE bigdata; -- 存储 Hive 元数据5.2 Hive 3.1.1 部署配置文件hive-site.xml配置 MySQL 连接URL、驱动、用户名、密码hive-env.sh设置export HADOOP_HOME/opt/hadoop-3.1.1/etc/profile添加HIVE_HOME及PATHhive-config.sh设置JAVA_HOME,HADOOP_HOME,HIVE_HOME初始化与启动复制 MySQL 驱动mysql-connector-java-8.0.14.jar到 Hive 的lib目录初始化元数据库schematool -initSchema -dbType mysql启动 Hivehive需 Hadoop 已启动六、Hive 数据库与表操作6.1 数据库 DDLsqlCREATE DATABASE classtest; CREATE DATABASE IF NOT EXISTS classtest; SHOW DATABASES; DESCRIBE DATABASE classtest; -- 查看库存储位置 DROP DATABASE IF EXISTS classtest;6.2 表操作三种建表方式① 直接建表sqlCREATE TABLE student ( id string COMMENT student id, name string COMMENT student name, age int COMMENT student age ) ROW FORMAT DELIMITED FIELDS TERMINATED BY STORED AS TEXTFILE LOCATION /user/hive/warehouse/classtest.db/student;② AS 查询建表复制结构和部分数据sqlCREATE TABLE student2 AS SELECT id, name FROM student WHERE id 150;③ LIKE 建表仅复制结构无数据sqlCREATE TABLE student3 LIKE student;其他常用命令SHOW TABLES;DESC student;-- 查看表结构ALTER TABLE 旧名 RENAME TO 新名;七、Hive 分区表重点7.1 单字段静态分区sqlCREATE TABLE cityperson ( id string, name string, age int ) PARTITIONED BY (city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ,;加载数据sqlLOAD DATA LOCAL INPATH /home/user1/citydataForxiamen.txt INTO TABLE cityperson PARTITION(cityxiamen);查看分区SHOW PARTITIONS cityperson;HDFS 存储结构/user/hive/warehouse/库名.db/表名/city值/7.2 多字段静态分区sqlCREATE TABLE agentinformation ( agentID string, agentName string, agentAddress string ) PARTITIONED BY (province string, city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ;加载数据示例sqlLOAD DATA LOCAL INPATH ... INTO TABLE agentinformation PARTITION(provinceshanxi, cityxian);覆盖数据添加OVERWRITE关键字sqlLOAD DATA LOCAL INPATH ... OVERWRITE INTO TABLE agentinformation PARTITION(provincefujian, cityxiamen);八、Sqoop 数据迁移8.1 Sqoop 安装与配置解压sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz到/opt配置sqoop-env.sh设置HADOOP_COMMON_HOME,HADOOP_MAPRED_HOME,HIVE_HOME配置/etc/profileexport SQOOP_HOME并修改PATH复制 MySQL 驱动到 Sqoop 的lib目录验证./bin/sqoop version8.2 数据导入操作① MySQL → HDFSbash./bin/sqoop import \ --connect jdbc:mysql://master:3306/teacherinfo?serverTimezoneUTC \ --username zeng --password 123456 \ --table teacher \ --delete-target-dir -m 3默认导入到/user/user1/teacherHDFS② MySQL → Hivebash./bin/sqoop import \ --connect jdbc:mysql://master:3306/teacherinfo?serverTimezoneUTC \ --username zeng --password 123456 \ --table teacher \ --delete-target-dir \ --hive-import --hive-table teacher -m 3前提将hive-common-3.1.1.jar复制到 Sqoop 的lib目录避免ClassNotFoundExceptionHive 中会自动创建表字段顺序可能与 MySQL 不同需注意③ 带条件导入--wherebash./bin/sqoop import --connect jdbc:mysql://master:3306/erp?serverTimezoneUTC \ --username zeng --password 123456 \ --table emp --columns ename,eaddress,esalary \ --where esalary 4000 \ --delete-target-dir --hive-import --hive-table esalary -m 38.3 常用参数参数作用--connectJDBC URL--table源表名--columns指定列--where过滤条件-mMap 任务数并行度--delete-target-dir删除目标目录避免冲突--hive-import导入到 Hive--hive-tableHive 表名九、常见问题与解决问题原因解决方法ClassNotFoundException: HiveConfSqoop 缺少 Hive jar复制hive-common-*.jar到 Sqoop 的libMySQL 安装后无法登录未设置 root 密码或加密方式不兼容sudo mysql进入用ALTER USER ... mysql_native_password重置密码Hadoop 启动无 SecondaryNameNode/etc/hosts配置错误检查主机名映射确保localhost和实际主机名正确NameNode 处于 safe mode刚启动或异常hdfs dfsadmin -safemode leave退出运行 MR 程序报MRAppMaster错误缺少环境变量在mapred-site.xml中添加yarn.app.mapreduce.am.env等配置权限被拒绝目录权限不对使用chmod调整为 700 或 755十、复习建议动手实操按照文档顺序从 Linux 基础 → JDK → SSH → Hadoop伪分布 → 集群→ Hive → Sqoop边做边记。重点掌握Hadoop 核心配置文件的含义core-site, hdfs-site, yarn-site, mapred-siteHive 分区表的设计与数据加载PARTITIONED BYLOAD DATASqoop 导入命令的参数组合排错能力熟悉常见报错信息如ClassNotFoundException,SafeModeException及其解决步骤。Web 监控熟悉 50070HDFS和 8088YARN端口的界面用于查看节点状态和作业运行情况。
大数据技术——核心知识点复习提纲
一、Linux 基础命令命令功能常用选项cd切换目录.当前目录..上级目录ls列出文件-a显示隐藏-l详细信息-R递归cat查看文件-n显示行号mkdir创建目录-p创建父目录rm删除文件/目录-f强制-r递归cp复制-f覆盖-R递归mv移动/重命名-f覆盖-i询问pwd显示当前路径–chmod修改权限数字法r4,w2,x1如700sudo以超级用户执行–source使配置生效–hostnamectl修改主机名set-hostnameifconfig查看IP地址–二、JDK 安装与环境变量步骤上传jdk-8u161-linux-x64.tar.gz到/opt解压sudo tar -zxvf jdk-8u161-linux-x64.tar.gz编辑/etc/profile添加bashexport JAVA_HOME/opt/jdk1.8.0_161 export JRE_HOME${JAVA_HOME}/jre export CLASSPATH.:${JAVA_HOME}/lib export PATH${JAVA_HOME}/bin:$PATHsource /etc/profile使其生效验证java -version三、SSH 免密登录配置操作流程生成密钥对ssh-keygen -t rsa一路回车将公钥追加到授权文件bashcat ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys设置权限bashchmod 755 ~ chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys分发公钥到其他节点bashssh-copy-id -i ~/.ssh/id_rsa.pub slave1验证ssh master无需密码即成功四、Hadoop 部署4.1 伪分布式部署单节点配置文件修改位于$HADOOP_HOME/etc/hadoop/文件关键配置hadoop-env.shexport JAVA_HOME/opt/jdk1.8.0_161core-site.xmlfs.defaultFS→hdfs://master:9000hadoop.tmp.dir→ 临时目录hdfs-site.xmldfs.replication→1dfs.namenode.name.dir和dfs.datanode.data.diryarn-site.xmlyarn.resourcemanager.hostname→masteryarn.nodemanager.aux-services→mapreduce_shufflemapred-site.xmlmapreduce.framework.name→yarn启动与验证格式化hdfs namenode -format启动start-all.sh或start-dfs.shstart-yarn.sh进程检查jps应看到NameNode, DataNode, SecondaryNameNode, ResourceManager, NodeManagerWeb 界面HDFS →http://master:50070YARN →http://master:80884.2 完全分布式集群部署节点规划节点HDFS 角色YARN 角色masterNameNodeResourceManagerslave1, slave2DataNodeNodeManager关键配置修改etc/hosts添加所有节点的 IP 与主机名映射masters文件写入masterworkers文件写入slave1、slave2删除 localhosthdfs-site.xmldfs.replication设为2删除临时目录datanode_1_dir中的内容克隆虚拟机后修改 slave 节点修改 IP 地址/etc/netplan/*.yaml修改主机名sudo hostnamectl set-hostname slave1修改hdfs-site.xml注释掉 NameNode 相关配置保留 DataNode 配置启动集群在 master 上格式化hdfs namenode -format启动start-all.sh分别在各节点用jps验证进程五、Hive 安装与配置5.1 MySQL 作为元数据库MySQL 8.0.14安装依赖sudo apt-get install libaio1按顺序安装 deb 包mysql-common,libmysqlclient21, ... ,mysql-community-server注意若安装时未设置 root 密码后续需用sudo mysql进入并执行sqlALTER USER rootlocalhost IDENTIFIED WITH mysql_native_password BY 123456; FLUSH PRIVILEGES;创建用户并授权sqlCREATE USER zeng% IDENTIFIED BY 123456; GRANT ALL ON *.* TO zeng%; CREATE DATABASE bigdata; -- 存储 Hive 元数据5.2 Hive 3.1.1 部署配置文件hive-site.xml配置 MySQL 连接URL、驱动、用户名、密码hive-env.sh设置export HADOOP_HOME/opt/hadoop-3.1.1/etc/profile添加HIVE_HOME及PATHhive-config.sh设置JAVA_HOME,HADOOP_HOME,HIVE_HOME初始化与启动复制 MySQL 驱动mysql-connector-java-8.0.14.jar到 Hive 的lib目录初始化元数据库schematool -initSchema -dbType mysql启动 Hivehive需 Hadoop 已启动六、Hive 数据库与表操作6.1 数据库 DDLsqlCREATE DATABASE classtest; CREATE DATABASE IF NOT EXISTS classtest; SHOW DATABASES; DESCRIBE DATABASE classtest; -- 查看库存储位置 DROP DATABASE IF EXISTS classtest;6.2 表操作三种建表方式① 直接建表sqlCREATE TABLE student ( id string COMMENT student id, name string COMMENT student name, age int COMMENT student age ) ROW FORMAT DELIMITED FIELDS TERMINATED BY STORED AS TEXTFILE LOCATION /user/hive/warehouse/classtest.db/student;② AS 查询建表复制结构和部分数据sqlCREATE TABLE student2 AS SELECT id, name FROM student WHERE id 150;③ LIKE 建表仅复制结构无数据sqlCREATE TABLE student3 LIKE student;其他常用命令SHOW TABLES;DESC student;-- 查看表结构ALTER TABLE 旧名 RENAME TO 新名;七、Hive 分区表重点7.1 单字段静态分区sqlCREATE TABLE cityperson ( id string, name string, age int ) PARTITIONED BY (city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ,;加载数据sqlLOAD DATA LOCAL INPATH /home/user1/citydataForxiamen.txt INTO TABLE cityperson PARTITION(cityxiamen);查看分区SHOW PARTITIONS cityperson;HDFS 存储结构/user/hive/warehouse/库名.db/表名/city值/7.2 多字段静态分区sqlCREATE TABLE agentinformation ( agentID string, agentName string, agentAddress string ) PARTITIONED BY (province string, city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ;加载数据示例sqlLOAD DATA LOCAL INPATH ... INTO TABLE agentinformation PARTITION(provinceshanxi, cityxian);覆盖数据添加OVERWRITE关键字sqlLOAD DATA LOCAL INPATH ... OVERWRITE INTO TABLE agentinformation PARTITION(provincefujian, cityxiamen);八、Sqoop 数据迁移8.1 Sqoop 安装与配置解压sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz到/opt配置sqoop-env.sh设置HADOOP_COMMON_HOME,HADOOP_MAPRED_HOME,HIVE_HOME配置/etc/profileexport SQOOP_HOME并修改PATH复制 MySQL 驱动到 Sqoop 的lib目录验证./bin/sqoop version8.2 数据导入操作① MySQL → HDFSbash./bin/sqoop import \ --connect jdbc:mysql://master:3306/teacherinfo?serverTimezoneUTC \ --username zeng --password 123456 \ --table teacher \ --delete-target-dir -m 3默认导入到/user/user1/teacherHDFS② MySQL → Hivebash./bin/sqoop import \ --connect jdbc:mysql://master:3306/teacherinfo?serverTimezoneUTC \ --username zeng --password 123456 \ --table teacher \ --delete-target-dir \ --hive-import --hive-table teacher -m 3前提将hive-common-3.1.1.jar复制到 Sqoop 的lib目录避免ClassNotFoundExceptionHive 中会自动创建表字段顺序可能与 MySQL 不同需注意③ 带条件导入--wherebash./bin/sqoop import --connect jdbc:mysql://master:3306/erp?serverTimezoneUTC \ --username zeng --password 123456 \ --table emp --columns ename,eaddress,esalary \ --where esalary 4000 \ --delete-target-dir --hive-import --hive-table esalary -m 38.3 常用参数参数作用--connectJDBC URL--table源表名--columns指定列--where过滤条件-mMap 任务数并行度--delete-target-dir删除目标目录避免冲突--hive-import导入到 Hive--hive-tableHive 表名九、常见问题与解决问题原因解决方法ClassNotFoundException: HiveConfSqoop 缺少 Hive jar复制hive-common-*.jar到 Sqoop 的libMySQL 安装后无法登录未设置 root 密码或加密方式不兼容sudo mysql进入用ALTER USER ... mysql_native_password重置密码Hadoop 启动无 SecondaryNameNode/etc/hosts配置错误检查主机名映射确保localhost和实际主机名正确NameNode 处于 safe mode刚启动或异常hdfs dfsadmin -safemode leave退出运行 MR 程序报MRAppMaster错误缺少环境变量在mapred-site.xml中添加yarn.app.mapreduce.am.env等配置权限被拒绝目录权限不对使用chmod调整为 700 或 755十、复习建议动手实操按照文档顺序从 Linux 基础 → JDK → SSH → Hadoop伪分布 → 集群→ Hive → Sqoop边做边记。重点掌握Hadoop 核心配置文件的含义core-site, hdfs-site, yarn-site, mapred-siteHive 分区表的设计与数据加载PARTITIONED BYLOAD DATASqoop 导入命令的参数组合排错能力熟悉常见报错信息如ClassNotFoundException,SafeModeException及其解决步骤。Web 监控熟悉 50070HDFS和 8088YARN端口的界面用于查看节点状态和作业运行情况。