概述
Hadoop Archives就是指Hadoop存档。Hadoop Archives是特殊格式的存档,它会映射一个文件系统目录。一个Hadoop Archives文件总是带有.har
扩展名
Hadoop存档(har文件)目录包含
元数据(采用_index和_masterindex形式)
数据部分data(part- *)文件。
_index文件包含归档文件的名称和部分文件中的位置。
应用场景
hdfs并不擅长存储小文件,因为每个文件最少占用一个block,每个block的元数据都会在namenode节点占用内存,如果存在这样大量的小文件,它们会吃掉namenode节点的大量内存。
hadoop Archives可以有效的处理以上问题,他可以把多个文件归档成为一个文件,归档成一个文件后还可以透明的访问每一个文件,并且可以做为mapreduce任务的输入。(但对于MapReduce 来说起不到任何作用,因为har文件就相当一个目录,仍然不能将小文件合并到一个split中去,一个小文件一个split)
创建档案文件
创建档案文件是一个Map/Reduce job,所以需要一个map reduce集群来运行它(启动YARN)。
1 | Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest> |
参数说明
- -archiveName 档案名.har:以
.har
为扩展名结尾的档案文件名字 - -p 父目录:指定归档文件基于的相对路径
- -r 副本数:所需的复制因子,不设置的话默认为3
- <src>*:要归档的文件源路径,可多个
- <dest>:har文件保存到的目标路径
Example:
1 | hadoop archive -archiveName foo.har -p /foo/bar -r 3 dir1 dir2 /user/hadoop |
/foo/bar
是dir1
,dir2
两个src路径的父目录,所以以上命令是归档/foo/bar/dir1
,/foo/bar/dir2
到 /user/hadoop/foo.bar
中
如果想归档目录 /foo/bar,可以省略src:
hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir
补充说明
- 创建档案文件是一个Map/Reduce job,所以需要一个map reduce集群来运行它(启动YARN)。
- 归档文件后,不会删除源文件。如果需要删除源文件(来减少namespace),需要自己手动删除。
- 如果您指定加密区域中的源文件,它们将被解密并写入存档。如果har文件不在加密区中,则它们将以解密的形式存储。如果har文件位于加密区域,它们将以加密形式存储。
查看归档中的文件
档案将自己公开为文件系统层。因此,档案中的所有fs shell命令都可以工作,但使用不同的URI。
Hadoop Archives的URI是
1 | HAR://方案-主机名:端口/ archivepath / fileinarchive |
如果没有提供方案,它假定底层文件系统。在这种情况下,URI看起来像
1 | HAR:/// archivepath / fileinarchive |
注意:档案是不可变的。所以,重命名,删除并创建返回一个错误。
如何解除归档
由于档案中的所有fs shell命令都是透明的,因此取消存档只是复制的问题。
依次取消存档:
1 | hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir |
要并行解压缩,请使用DistCp:
1 | hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir |
Hadoop Archives and MapReduce
在MapReduce中,与输入数据 使用默认文件系统一样,也可以使用Hadoop Archives(归档)文件作为输入文件系统。如果你有存储在HDFS目录下/user/zoo/foo.har
的Hadoop Archives(归档)文件 ,然后你在MapReduce程序中就可以使用如下路径har:///user/zoo/foo.har
作为输入文件。
由于Hadoop Archives(归档)文件是作为一种文件类型,MapReduce将能够使用Hadoop Archives(归档)文件中的所有逻辑输入文件作为输入源。
个人示例
准备文件
1
2
3
4
5
6
7[hadoop@hadoop001 data]$ hdfs dfs -ls -R /user/hadoop/input
-rw-r--r-- 1 hadoop supergroup 11 2021-12-19 15:54 /user/hadoop/input/a.log
-rw-r--r-- 1 hadoop supergroup 18 2021-12-19 15:54 /user/hadoop/input/b.log
-rw-r--r-- 1 hadoop supergroup 11 2021-12-19 15:54 /user/hadoop/input/c.log
drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 /user/hadoop/input/d
-rw-r--r-- 1 hadoop supergroup 4 2021-12-19 15:54 /user/hadoop/input/d/e.log创建har文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84[hadoop@hadoop001 data]$ hadoop archive -archiveName input.har -p /user/hadoop/input /user/hadoop
2021-12-19 15:56:44,393 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-19 15:56:45,593 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-19 15:56:46,217 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-19 15:56:46,258 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-19 15:56:46,685 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1639763497373_0008
2021-12-19 15:56:47,302 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-19 15:56:47,571 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639763497373_0008
2021-12-19 15:56:47,578 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-19 15:56:47,895 INFO conf.Configuration: resource-types.xml not found
2021-12-19 15:56:47,895 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-19 15:56:48,044 INFO impl.YarnClientImpl: Submitted application application_1639763497373_0008
2021-12-19 15:56:48,119 INFO mapreduce.Job: The url to track the job: http://hadoop001:8088/proxy/application_1639763497373_0008/
2021-12-19 15:56:48,124 INFO mapreduce.Job: Running job: job_1639763497373_0008
2021-12-19 15:56:58,359 INFO mapreduce.Job: Job job_1639763497373_0008 running in uber mode : false
2021-12-19 15:56:58,361 INFO mapreduce.Job: map 0% reduce 0%
2021-12-19 15:57:05,437 INFO mapreduce.Job: map 100% reduce 0%
2021-12-19 15:57:12,484 INFO mapreduce.Job: map 100% reduce 100%
2021-12-19 15:57:13,506 INFO mapreduce.Job: Job job_1639763497373_0008 completed successfully
2021-12-19 15:57:13,611 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=425
FILE: Number of bytes written=473491
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=581
HDFS: Number of bytes written=450
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=12
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4796
Total time spent by all reduces in occupied slots (ms)=4103
Total time spent by all map tasks (ms)=4796
Total time spent by all reduce tasks (ms)=4103
Total vcore-milliseconds taken by all map tasks=4796
Total vcore-milliseconds taken by all reduce tasks=4103
Total megabyte-milliseconds taken by all map tasks=4911104
Total megabyte-milliseconds taken by all reduce tasks=4201472
Map-Reduce Framework
Map input records=6
Map output records=6
Map output bytes=407
Map output materialized bytes=425
Input split bytes=118
Combine input records=0
Combine output records=0
Reduce input groups=6
Reduce shuffle bytes=425
Reduce input records=6
Reduce output records=0
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=181
CPU time spent (ms)=1520
Physical memory (bytes) snapshot=322760704
Virtual memory (bytes) snapshot=5437816832
Total committed heap usage (bytes)=170004480
Peak Map Physical memory (bytes)=212164608
Peak Map Virtual memory (bytes)=2717405184
Peak Reduce Physical memory (bytes)=110596096
Peak Reduce Virtual memory (bytes)=2720411648
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=419
File Output Format Counters
Bytes Written=0
[hadoop@hadoop001 data]$ hdfs dfs -ls /user/hadoop/
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 /user/hadoop/input
drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:57 /user/hadoop/input.har查看文件组成结构
1
2
3
4
5
6
7
8[hadoop@hadoop001 data]$ hdfs dfs -cat /user/hadoop/input.har
cat: `/user/hadoop/input.har': Is a directory
[hadoop@hadoop001 data]$ hdfs dfs -ls /user/hadoop/input.har
Found 4 items
-rw-r--r-- 1 hadoop supergroup 0 2021-12-19 15:57 /user/hadoop/input.har/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 383 2021-12-19 15:57 /user/hadoop/input.har/_index
-rw-r--r-- 3 hadoop supergroup 23 2021-12-19 15:57 /user/hadoop/input.har/_masterindex
-rw-r--r-- 3 hadoop supergroup 44 2021-12-19 15:57 /user/hadoop/input.har/part-0使用hdfs文件系统查看har文件目录内容
1
2
3
4
5
6
7
8
9
10
11
12
13[hadoop@hadoop001 data]$ hdfs dfs -ls har:///user/hadoop/input.har
Found 4 items
-rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/a.log
-rw-r--r-- 3 hadoop supergroup 18 2021-12-19 15:54 har:///user/hadoop/input.har/b.log
-rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/c.log
drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 har:///user/hadoop/input.har/d
[hadoop@hadoop001 data]$ hdfs dfs -ls -R har:///user/hadoop/input.har
2021-12-19 16:03:48,906 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/a.log
-rw-r--r-- 3 hadoop supergroup 18 2021-12-19 15:54 har:///user/hadoop/input.har/b.log
-rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/c.log
drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 har:///user/hadoop/input.har/d
-rw-r--r-- 3 hadoop supergroup 4 2021-12-19 15:54 har:///user/hadoop/input.har/d/e.log