Sqoop:部署与使用

介绍

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

https://sqoop.apache.org/

​ 传统的应用管理系统,也就是与关系型数据库的使用RDBMS应用程序的交互,是产生大数据的来源之一。这样大的数据,由关系数据库生成的,存储在关系数据库结构关系数据库服务器。

​ 当大数据存储器和分析器,如MapReduce, Hive, HBase, Cassandra, Pig等,Hadoop的生态系统等应运而生图片,它们需要一个工具来用的导入和导出的大数据驻留在其中的关系型数据库服务器进行交互。在这里,Sqoop占据着Hadoop生态系统提供关系数据库服务器和Hadoop HDFS之间的可行的互动。

​ Sqoop是Hadoop和关系数据库服务器之间传送数据的一种工具。它是用来从关系数据库如MySQL,Oracle到Hadoop的HDFS从Hadoop文件系统导出数据到关系数据库。

Sqoop的工作流程

部署

下载tar包

地址:http://archive.apache.org/dist/sqoop/

sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

解压到相应目录

1
2
3
[hadoop@hadoop001 software]$ mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop-1.4.7
[hadoop@hadoop001 software]$ tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[hadoop@hadoop001 app]$ ln -s /home/hadoop/software/sqoop-1.4.7 /home/hadoop/app/sqoop

修改conf/sqoop-env.sh

1
2
3
4
5
6
7
8
9
10
[hadoop@hadoop001 sqoop]$ cd conf/
[hadoop@hadoop001 conf]$ cp sqoop-env-template.sh sqoop-env.sh
[hadoop@hadoop001 conf]$ ll
total 32
-rw-rw-r--. 1 hadoop hadoop 3895 Dec 18 2017 oraoop-site-template.xml
-rwxr-xr-x. 1 hadoop hadoop 1345 Feb 27 23:04 sqoop-env.sh
-rw-rw-r--. 1 hadoop hadoop 1404 Dec 18 2017 sqoop-env-template.cmd
-rwxr-xr-x. 1 hadoop hadoop 1345 Dec 18 2017 sqoop-env-template.sh
-rw-rw-r--. 1 hadoop hadoop 6044 Dec 18 2017 sqoop-site-template.xml
-rw-rw-r--. 1 hadoop hadoop 6044 Dec 18 2017 sqoop-site.xml

配置相关变量(本机暂未部署hbase和zk)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Set Hadoop-specific environment variables here.

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/home/hadoop/app/hive

#Set the path for where zookeper config dir is
#export ZOOCFGDIR=

添加环境变量

1
[hadoop@hadoop001 ~]$ vi .bash_profile 
1
2
export SQOOP_HOME=/home/hadoop/app/sqoop
export PATH=${SQOOP_HOME}/bin:$PATH
1
[hadoop@hadoop001 ~]$ source .bash_profile 

拷贝mysql驱动包到sqoop的lib目录下

1
[hadoop@hadoop001 ~]$ cp lib/mysql-connector-java-5.1.47.jar app/sqoop/lib/

测试Sqoop是否能够成功连接数据库

1
sqoop list-databases --connect jdbc:mysql://hadoop001:3306 --username root --password 123456

成功访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[hadoop@hadoop001 lib]$ sqoop list-databases --connect jdbc:mysql://hadoop001:3306 --username root --password '123456'
Warning: /home/hadoop/app/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/app/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/app/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/app/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2022-02-27 23:54:08,530 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2022-02-27 23:54:08,647 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2022-02-27 23:54:08,832 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Sun Feb 27 23:54:09 GMT 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
information_schema
mysql
performance_schema
sys

问题小结

  • Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    [hadoop@hadoop001 ~]$ sqoop list-databases --connect jdbc:mysql://hadoop001:3306 --username root --password '123456'
    Warning: /home/hadoop/app/sqoop/../hbase does not exist! HBase imports will fail.
    Please set $HBASE_HOME to the root of your HBase installation.
    Warning: /home/hadoop/app/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
    Please set $HCAT_HOME to the root of your HCatalog installation.
    Warning: /home/hadoop/app/sqoop/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    Warning: /home/hadoop/app/sqoop/../zookeeper does not exist! Accumulo imports will fail.
    Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
    2022-02-27 23:25:34,723 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
    2022-02-27 23:25:34,838 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    2022-02-27 23:25:35,007 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
    at org.apache.sqoop.manager.MySQLManager.initOptionDefaults(MySQLManager.java:73)
    at org.apache.sqoop.manager.SqlManager.<init>(SqlManager.java:89)
    at com.cloudera.sqoop.manager.SqlManager.<init>(SqlManager.java:33)
    at org.apache.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:51)
    at com.cloudera.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:30)
    at org.apache.sqoop.manager.CatalogQueryManager.<init>(CatalogQueryManager.java:46)
    at com.cloudera.sqoop.manager.CatalogQueryManager.<init>(CatalogQueryManager.java:31)
    at org.apache.sqoop.manager.InformationSchemaManager.<init>(InformationSchemaManager.java:38)
    at com.cloudera.sqoop.manager.InformationSchemaManager.<init>(InformationSchemaManager.java:31)
    at org.apache.sqoop.manager.MySQLManager.<init>(MySQLManager.java:65)
    at org.apache.sqoop.manager.DefaultManagerFactory.accept(DefaultManagerFactory.java:67)
    at org.apache.sqoop.ConnFactory.getManager(ConnFactory.java:184)
    at org.apache.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:272)
    at org.apache.sqoop.tool.ListDatabasesTool.run(ListDatabasesTool.java:44)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
    Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 20 more
    [hadoop@hadoop001 ~]$

    原因:Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils

    Sqoop1.4.7默认只加载了commons-lang3-3.4.jar的jar包,里面的StringUtils类的package为:org/apache/commons/lang3/StringUtils,所以直接使用sqoop命令时报上述错误。

    解决方法

    将旧版的jar包下载并导入到sqoop目录下的lib目录下即可

    下载:commons-lang-2.6.jar

  • MySQL登录验证失败

    1
    2
    2022-02-27 23:45:27,936 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: java.sql.SQLException: Access denied for user 'root'@'hadoop001' (using password: YES)
    java.lang.RuntimeException: java.sql.SQLException: Access denied for user 'root'@'hadoop001' (using password: YES)

    检查密码有没有输入出错,在--password选项建议添加单引号输入,如:'password'

使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[hadoop@hadoop001 ~]$ sqoop help
2022-03-01 16:31:57,832 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop COMMAND [ARGS]

Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information

See 'sqoop help COMMAND' for information on a specific command.

列出数据库list-databases

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hadoop001 ~]$ sqoop list-databases --connect jdbc:mysql://hadoop001:3306?useSSL=false --username root --password '123456'
2022-02-28 10:35:17,817 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2022-02-28 10:35:17,972 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2022-02-28 10:35:18,212 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
azkaban
hive
hue
mysql
performance_schema
ruozedata
sys

列出所有表list-databases

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[hadoop@hadoop001 ~]$ sqoop list-tables --connect jdbc:mysql://hadoop001:3306/mysql?useSSL=false --username root --password '123456'
2022-02-28 10:38:45,712 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2022-02-28 10:38:45,823 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2022-02-28 10:38:46,000 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
columns_priv
db
engine_cost
event
func
general_log
gtid_executed
help_category
help_keyword
help_relation
help_topic
innodb_index_stats
innodb_table_stats
ndb_binlog_index
plugin
proc
procs_priv
proxies_priv
server_cost
servers
slave_master_info
slave_relay_log_info
slave_worker_info
slow_log
tables_priv
time_zone
time_zone_leap_second
time_zone_name
time_zone_transition
time_zone_transition_type
user

导入import

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
[hadoop@hadoop001 ~]$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class name
--connection-param-file <properties-file> Specify connection parameters file
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-home <hdir> Override $HADOOP_MAPRED_HOME_ARG
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME_ARG
--help Print usage instructions
--metadata-transaction-isolation-level <isolationlevel> Defines the transaction isolation level for metadata queries.
For more details check java.sql.Connection javadoc
or the JDBC specificaiton
--oracle-escaping-disabled <boolean> Disable the escaping mechanism of the Oracle/OraOop
connection managers
-P Read password from console
--password <password> Set authentication password
--password-alias <password-alias> Credential provider passwor alias
--password-file <password-file> Set authentication password file path
--relaxed-isolation Use read-uncommitted isolation for imports
--skip-dist-cache Skip copying jars to distributed cache
--temporary-rootdir <rootdir> Defines the temporary root directory for the import
--throw-on-error Rethrow a RuntimeException on error occurred during the job
--username <username> Set authenticati on username
--verbose Print more information while working

Import control arguments:
--append Imports data in append mode
--as-avrodatafile Imports data to Avro data files
--as-parquetfile Imports data to Parquet files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--autoreset-to-one-mapper Reset the number of mappers to one mapper
if no split key available
--boundary-query <statement> Set boundary query for retrieving max and min value of
the primary key
--columns <col,col,col...> Columns to import from table
--compression-codec <codec> Compression codec to use for import
--delete-target-dir Imports data in delete mode
--direct Use direct import fast path
--direct-split-size <n> Split the input stream every 'n' bytes when importing in direct mode
-e,--query <statement> Import results of SQL 'statement'
--fetch-size <n> Set number 'n' of rows to fetch from the database when more rows are needed
--inline-lob-limit <n> Set the maximum size for an inline LOB
-m,--num-mappers <n> Use 'n' map tasks to import in parallel
--mapreduce-job-name <name> Set name for generated mapreduce job
--merge-key <column> Key column to use to join results
--split-by <column-name> Column of the table used to split work units
--split-limit <size> Upper Limit of rows per split for split columns
of Date/Time/Timestamp and integer types. For date or timestamp
fields it is calculated in seconds. split-limit should be
greater than 0
--table <table-name> Table to read
--target-dir <dir> HDFS plain table destination
--validate Validate the copy using the configured validator
--validation-failurehandler <validation-failurehandler> Fully qualified class name for ValidationFailureHandler
--validation-threshold <validation-threshold> Fully qualified class name for ValidationThreshold
--validator <validator> Fully qualified class name for the Validator
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression

Incremental import arguments:
--check-column <column> Source column to check for incremental change
--incremental <import-type> Define an incremental import of type 'append' or 'lastmodified'
--last-value <value> Last imported value in the incremental check column

Output line formatting arguments:
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL's default delimiter set: fields: , lines: \n escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by <char> Sets a field enclosing character

Input parsing arguments:
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line char
--input-optionally-enclosed-by <char> Sets a field enclosing character

Hive arguments:
--create-hive-table Fail if the target hive table exists
--external-table-dir <hdfs path> Sets where the external table is in HDFS
--hive-database <database-name> Sets the database name to use when importing to hive
--hive-delims-replacement <arg> Replace Hive record \0x01 and row delimiters (\n\r)
from imported string fields with user-defined string
--hive-drop-import-delims Drop Hive record \0x01 and row delimiters (\n\r) from imported string fields
--hive-home <dir> Override $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive's default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table
--hive-partition-key <partition-key> Sets the partition key to use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to use when importing to hive
--hive-table <table-name> Sets the table name to use when importing to hive
--map-column-hive <arg> Override mapping for specific column to hive types.

HBase arguments:
--column-family <family> Sets the target column family for the import
--hbase-bulkload Enables HBase bulk loading
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the row key
--hbase-table <table> Import to <table> in HBase

HCatalog arguments:
--hcatalog-database <arg> HCatalog database name
--hcatalog-home <hdir> Override $HCAT_HOME
--hcatalog-partition-keys <partition-key> Sets the partition keys to use when importing to hive
--hcatalog-partition-values <partition-value> Sets the partition values to use when importing to hive
--hcatalog-table <arg> HCatalog table name
--hive-home <dir> Override $HIVE_HOME
--hive-partition-key <partition-key> Sets the partition key to use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to use when importing to hive
--map-column-hive <arg> Override mapping for specific column to hive types.

HCatalog import specific options:
--create-hcatalog-table Create HCatalog before import
--drop-and-create-hcatalog-table Drop and Create HCatalog before import
--hcatalog-storage-stanza <arg> HCatalog storage stanza for table creation

Accumulo arguments:
--accumulo-batch-size <size> Batch size in bytes
--accumulo-column-family <family> Sets the target column family for the import
--accumulo-create-table If specified, create missing Accumulo tables
--accumulo-instance <instance> Accumulo instance name.
--accumulo-max-latency <latency> Max write latency in milliseconds
--accumulo-password <password> Accumulo password.
--accumulo-row-key <col> Specifies which input column to use as the row key
--accumulo-table <table> Import to <table> in Accumulo
--accumulo-user <user> Accumulo user name.
--accumulo-visibility <vis> Visibility token to be applied to all rows imported
--accumulo-zookeepers <zookeepers> Comma-separated list of zookeepers (host:port)

Code generation arguments:
--bindir <dir> Output directory for compiled objects
--class-name <name> Sets the generated class name. This overrides --package-name.
When combined with --jar-file, sets the input class.
--escape-mapping-column-names <boolean> Disable special characters escaping in column names
--input-null-non-string <null-str> Input null non-string representation
--input-null-string <null-str> Input null string representation
--jar-file <file> Disable code generation; use specified jar
--map-column-java <arg> Override mapping for specific columns to java types
--null-non-string <null-str> Null non-string representation
--null-string <null-str> Null string representation
--outdir <dir> Output directory for generated code
--package-name <name> Put auto-generated classes in this package

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]


At minimum, you must specify --connect and --table
Arguments to mysqldump and other subprograms may be supplied
after a '--' on the command line.

注意

  1. 导入路径默认是在/user/{username}/下,我的hadoop用户名为hadoop,所以导出路径是/user/hadoop/EMP_COLUMN

  2. 默认导入是4个文件,是同时4个task在运行的

  3. 当table没有设置primary key时,需要指定--split-by或者设置并行度为-m 1,因为如果并行度不为1,导出表是需要根据指定的字段或者主键计算分割到每个task任务的记录数量。

  4. 使用-e或者--query查询数据时,要在语句里where条件上加上'$CONDITIONS'

    1
    2
    3
    4
    Import failed: java.io.IOException: Query [SELECT * FROM emp WHERE EMPNO>=7900] must contain '$CONDITIONS' in WHERE clause.`
    -e "SELECT * FROM emp WHERE EMPNO>=7900"
    应该写成:
    -e "SELECT * FROM emp WHERE EMPNO>=7900 AND \$CONDITIONS"

    同时,-e--query不与--where--columns--table同时使用。

hadoop案例

MySQL表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
mysql> use ruozedata;
Database changed
mysql> select * from emp;
+-------+--------+-----------+------+---------------------+---------+---------+--------+
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
+-------+--------+-----------+------+---------------------+---------+---------+--------+
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 00:00:00 | 800.00 | NULL | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 00:00:00 | 1600.00 | 300.00 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-02-22 00:00:00 | 1250.00 | 500.00 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-04-02 00:00:00 | 2975.00 | NULL | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 00:00:00 | 1250.00 | 1400.00 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 00:00:00 | 2850.00 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 00:00:00 | 2450.00 | NULL | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 00:00:00 | 3000.00 | NULL | 20 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 00:00:00 | 5000.00 | NULL | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 00:00:00 | 1500.00 | 0.00 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1983-01-12 00:00:00 | 1100.00 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-03 00:00:00 | 950.00 | NULL | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-03 00:00:00 | 3000.00 | NULL | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 00:00:00 | 1300.00 | NULL | 10 |
+-------+--------+-----------+------+---------------------+---------+---------+--------+
14 rows in set (0.02 sec)

导出ruozedata数据库emp表到hadoop的/home/{username}下

import语句:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
sqoop import \
--connect jdbc:mysql://hadoop001:3306/ruozedata \
--username root \
--password '123456' \
--delete-target-dir \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--mapreduce-job-name EmpFromMySQL2HDFS \
--table emp \
--null-string '' \
--null-non-string 0 \
--fields-terminated-by '\t' \
--where 'SAL>2000' \
--target-dir EMP_COLUMN \
-m 1

说明:

  • –connect:连接的数据库url
  • –username:访问用户名
  • –password:访问用户密码
  • –delete-target-dir:先删除数据目录
  • –columns:选择导出的字段,没有该选项则全部导出
  • –mapreduce-job-name:重命名MapReduce作业名称
  • –table emp:指定要导出的表
  • –null-string:表中string类型字段为null时填充的值
  • –null-non-string:表中非string类型字段为null时填充的值
  • –fields-terminated-by:输出文件中字段分隔符
  • –where:过滤条件
  • –target-dir:输出目录
  • -m:并行度,即MR的task数量

Hive案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sqoop import \
--connect jdbc:mysql://localhost:3306/ruozedata \
--username root \
--password '123456' \
--delete-target-dir \
--hive-database ruozedata_hive \
--hive-import \
--hive-overwrite \
--hive-table emp_column \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--mapreduce-job-name EmpFromMySQL2Hive \
--table emp \
--null-string '' \
--null-non-string 0 \
--fields-terminated-by '\t'
--hive-partition-key 'day' \
--hive-partition-value 'yyyyMMdd'

说明:

  • –hive-database:指定导入到的hive数据库

  • –hive-import:导入hive

  • –hive-overwrite:覆盖模式

  • –hive-table:导入的hive表名

    如果是分区表,还有以下2个选项

  • –hive-partition-key:分区字段key

  • –hive-partition-value:分区字段值value

导出export

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
[hadoop@hadoop001 ~]$ sqoop help export
Warning: /home/hadoop/app/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/app/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/app/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/app/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2022-02-28 11:50:42,057 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop export [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class name
--connection-param-file <properties-file> Specify connection parameters file
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-home <hdir> Override $HADOOP_MAPRED_HOME_ARG
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME_ARG
--help Print usage instructions
--metadata-transaction-isolation-level <isolationlevel> Defines the transaction isolation level for metadata queries.
For more details check java.sql.Connection javadoc
or the JDBC specificaiton
--oracle-escaping-disabled <boolean> Disable the escaping mechanism of the Oracle/OraOop connection
managers Read password from console
--password <password> Set authenticati on password
--password-alias <password-alias> Credential provider password alias
--password-file <password-file> Set authentication password file path
--relaxed-isolation Use read-uncommitted isolation for imports
--skip-dist-cache Skip copying jars to distributed cache
--temporary-rootdir <rootdir> Defines the temporary root directory for the import
--throw-on-error Rethrow a RuntimeException on error occurred during the job
--username <username> Set authentication username
--verbose Print more information while working

Export control arguments:
--batch Indicates underlying statements to be executed in batch mode
--call <arg> Populate the table using this stored procedure (one call per row)
--clear-staging-table Indicates that any data in staging table can be deleted
--columns <col,col,col...> Columns to export to table
--direct Use direct export fast path
--export-dir <dir> HDFS source path for the export
-m,--num-mappers <n> Use 'n' maptasks toexport in parallel
--mapreduce-job-name <name> Set name for generated mapreduce job
--staging-table <table-name> Intermediate staging table
--table <table-name> Table to populate
--update-key <key> Update records by specified key column
--update-mode <mode> Specifies how updates are performed when new rows are
found with non-matching keys in database
--validate Validate the copy using the configured validator
--validation-failurehandler <validation-failurehandler> Fully qualified class name for ValidationFailureHandler
--validation-threshold <validation-threshold> Fully qualified class name for ValidationThreshold
--validator <validator> Fullyqualified class name for the Validator

Input parsing arguments:
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line char
--input-optionally-enclosed-by <char> Sets a field enclosing character

Output line formatting arguments:
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL's default delimiter set: fields: , lines: \n escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by <char> Sets a field enclosing character

Code generation arguments:
--bindir <dir> Output directory for compiled objects
--class-name <name> Sets the generated class name. This overrides --package-name. When
combined with --jar-file, sets the input class.
--escape-mapping-column-names <boolean> Disable special characters escaping in column names
--input-null-non-string <null-str> Input null non-string representation
--input-null-string <null-str> Input null string representation
--jar-file <file> Disable code generation; use specified jar
--map-column-java <arg> Override mapping for specific columns to java types
--null-non-string <null-str> Null non-string representation
--null-string <null-str> Null string representation
--outdir <dir> Output directory for generated code
--package-name <name> Put auto-generated classes in this package

HCatalog arguments:
--hcatalog-database <arg> HCatalog database name
--hcatalog-home <hdir> Override $HCAT_HOME
--hcatalog-partition-keys <partition-key> Sets the partition keys to use when importing to hive
--hcatalog-partition-values <partition-value> Sets the partition values to use when importing to hive
--hcatalog-table <arg> HCatalog table name
--hive-home <dir> Override $HIVE_HOME
--hive-partition-key <partition-key> Sets the partition key to use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to use when importing to hive
--map-column-hive <arg> Override mapping for specific column to hive types.

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]


At minimum, you must specify --connect, --export-dir, and --table

导出emp2表

1
2
3
4
5
6
7
sqoop export \
--connect jdbc:mysql://hadoop001:3306/ruozedata \
--username root \
--password '123456' \
--table emp2 \
--fields-terminated-by '\t' \
--export-dir /user/hadoop/emp

作业job

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[hadoop@hadoop001 ~]$ sqoop job --help
Warning: /home/hadoop/app/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/app/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/app/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/app/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2022-03-01 15:55:19,426 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]]

Job management arguments:
--create <job-id> Create a new saved job
--delete <job-id> Delete a saved job
--exec <job-id> Run a saved job
--help Print usage instructions
--list List saved jobs
--meta-connect <jdbc-uri> Specify JDBC connect string for the
metastore
--show <job-id> Show the parameters for a saved job
--verbose Print more information while working

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

sqoop 作业的语法是:

1
sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

注意–后面和subtool-name之间有空格

创建job(–create)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sqoop job --create myjob  \
--import \
--connect jdbc:mysql://hadoop001:3306/ruozedata \
--username root \
--password '123456' \
--delete-target-dir \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--mapreduce-job-name EmpFromMySQL2HDFS \
--table emp \
--null-string '' \
--null-non-string 0 \
--fields-terminated-by '\t' \
--where 'SAL>2000' \
--target-dir EMP_COLUMN \
-m 1

问题:

报错:Exception in thread “main” java.lang.NoClassDefFoundError: org/json/JSONObject

原因:sqoop缺少java-json.jar包.

解决:这是因为sqoop缺少java-json.jar包,下载java-json.jar.zip并添加到sqoop/lib目录下

验证作业 (–list)

--list 参数是用来验证保存Sqoop作业的列表。

1
2
3
4
[hadoop@hadoop001 ~]$ sqoop job --list
2022-03-01 16:20:34,767 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Available jobs:
myjob

检查作业(–show)

--show 参数用于检查或验证特定的工作,及其详细信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[hadoop@hadoop001 ~]$ sqoop job --show myjob
2022-03-01 16:21:05,716 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Enter password:
Job: myjob
Tool: import
Options:
----------------------------
verbose = false
hcatalog.drop.and.create.table = false
db.connect.string = jdbc:mysql://hadoop001:3306/ruozedata
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
mainframe.input.dataset.type = p
hbase.create.table = false
split.limit = null
null.string =
...

执行作业 (–exec)

--exec 选项用于执行保存的作业。

1
sqoop job --exec myjob

删除作业 (–delete)

1
sqoop job --delete myjob

eval工具

它允许用户执行用户定义的查询,对各自的数据库服务器和预览结果在控制台中。这样,用户可以期望得到的表数据来导入。使用eval我们可以评估任何类型的SQL查询可以是DDL或DML语句。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
[hadoop@hadoop001 ~]$ sqoop help eval
2022-03-01 16:38:37,514 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop eval [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class name
--connection-param-file <properties-file> Specify connection parameters file
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-home <hdir> Override $HADOOP_MAPRED_HOME_ARG
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME_ARG
--help Print usage instructions
--metadata-transaction-isolation-level <isolationlevel> Defines the transaction isolation level for metadata queries.
For more details check java.sql.Connection javadoc or the JDBC
specificaiton
--oracle-escaping-disabled <boolean> Disable the escaping mechanism of the Oracle/OraOop
connection managers
-P Read password from console
--password <password> Set authenticati on password
--password-alias <password-alias> Credential provider password alias
--password-file <password-file> Set authentication password file path
--relaxed-isolation Use read-uncommitted isolation for imports
--skip-dist-cache Skip copying jars to distributed cache
--temporary-rootdir <rootdir> Defines the temporary root directory for the import
--throw-on-error Rethrow a RuntimeExcep tion on error occurred during the job
--username <username> Set authentication username
--verbose Print more information while working

SQL evaluation arguments:
-e,--query <statement> Execute 'statement' in SQL and exit

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[hadoop@hadoop001 ~]$ sqoop eval --connect jdbc:mysql://hadoop001:3306/ruozedata --username root --password '123456' --query "SELECT * FROM emp where deptno=10"
2022-03-01 16:45:12,146 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2022-03-01 16:45:12,256 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2022-03-01 16:45:12,426 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Tue Mar 01 16:45:12 GMT 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
----------------------------------------------------------------------------------------------
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
----------------------------------------------------------------------------------------------
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 00:00:00.0 | 2450.00 | (null) | 10 |
| 7839 | KING | PRESIDENT | (null) | 1981-11-17 00:00:00.0 | 5000.00 | (null) | 10 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 00:00:00.0 | 1300.00 | (null) | 10 |
----------------------------------------------------------------------------------------------
[hadoop@hadoop001 ~]$ sqoop eval --connect jdbc:mysql://hadoop001:3306/ruozedata --username root --password 'ruozedata001' --query "INSERT INTO emp VALUES(7934,'KKK','CLERK',7782,'1985-01-20 00:00:00.0',1400,200,10)"
2022-03-01 16:49:02,886 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2022-03-01 16:49:03,001 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2022-03-01 16:49:03,169 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Tue Mar 01 16:49:03 GMT 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
2022-03-01 16:49:03,566 INFO tool.EvalSqlTool: 1 row(s) updated.
[hadoop@hadoop001 ~]$ sqoop eval --connect jdbc:mysql://hadoop001:3306/ruozedata --username root --password 'ruozedata001' --query "SELECT * FROM emp where deptno=10"
2022-03-01 16:49:28,104 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2022-03-01 16:49:28,239 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2022-03-01 16:49:28,391 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Tue Mar 01 16:49:28 GMT 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
----------------------------------------------------------------------------------------------
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
----------------------------------------------------------------------------------------------
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 00:00:00.0 | 2450.00 | (null) | 10 |
| 7839 | KING | PRESIDENT | (null) | 1981-11-17 00:00:00.0 | 5000.00 | (null) | 10 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 00:00:00.0 | 1300.00 | (null) | 10 |
| 7934 | KKK | CLERK | 7782 | 1985-01-20 00:00:00.0 | 1400.00 | 200.00 | 10 |
----------------------------------------------------------------------------------------------