Hadoop Archives

概述

Hadoop Archives就是指Hadoop存档。Hadoop Archives是特殊格式的存档,它会映射一个文件系统目录。一个Hadoop Archives文件总是带有.har扩展名

Hadoop存档(har文件)目录包含

  • 元数据(采用_index和_masterindex形式)

  • 数据部分data(part- *)文件。

_index文件包含归档文件的名称和部分文件中的位置。

img

应用场景

​ hdfs并不擅长存储小文件,因为每个文件最少占用一个block,每个block的元数据都会在namenode节点占用内存,如果存在这样大量的小文件,它们会吃掉namenode节点的大量内存。
​ hadoop Archives可以有效的处理以上问题,他可以把多个文件归档成为一个文件,归档成一个文件后还可以透明的访问每一个文件,并且可以做为mapreduce任务的输入。(但对于MapReduce 来说起不到任何作用,因为har文件就相当一个目录,仍然不能将小文件合并到一个split中去,一个小文件一个split)

创建档案文件

创建档案文件是一个Map/Reduce job,所以需要一个map reduce集群来运行它(启动YARN)。

1
2
Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>
用法:hadoop archive -archiveName 归档名称 -p 父目录 [-r <复制因子>] 原路径(可以多个) 目的路径

参数说明

  • -archiveName 档案名.har:以.har为扩展名结尾的档案文件名字
  • -p 父目录:指定归档文件基于的相对路径
  • -r 副本数:所需的复制因子,不设置的话默认为3
  • <src>*:要归档的文件源路径,可多个
  • <dest>:har文件保存到的目标路径

Example:

1
hadoop archive -archiveName foo.har -p /foo/bar -r 3 dir1 dir2 /user/hadoop

/foo/bardir1dir2两个src路径的父目录,所以以上命令是归档/foo/bar/dir1/foo/bar/dir2/user/hadoop/foo.bar

如果想归档目录 /foo/bar,可以省略src:

hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir

补充说明

  1. 创建档案文件是一个Map/Reduce job,所以需要一个map reduce集群来运行它(启动YARN)。
  2. 归档文件后,不会删除源文件。如果需要删除源文件(来减少namespace),需要自己手动删除。
  3. 如果您指定加密区域中的源文件,它们将被解密并写入存档。如果har文件不在加密区中,则它们将以解密的形式存储。如果har文件位于加密区域,它们将以加密形式存储。

查看归档中的文件

档案将自己公开为文件系统层。因此,档案中的所有fs shell命令都可以工作,但使用不同的URI。

Hadoop Archives的URI是

1
HAR://方案-主机名:端口/ archivepath / fileinarchive

如果没有提供方案,它假定底层文件系统。在这种情况下,URI看起来像

1
HAR:/// archivepath / fileinarchive

注意:档案是不可变的。所以,重命名,删除并创建返回一个错误。

如何解除归档

由于档案中的所有fs shell命令都是透明的,因此取消存档只是复制的问题。

依次取消存档:

1
hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir

要并行解压缩,请使用DistCp:

1
hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir

Hadoop Archives and MapReduce

​ 在MapReduce中,与输入数据 使用默认文件系统一样,也可以使用Hadoop Archives(归档)文件作为输入文件系统。如果你有存储在HDFS目录下/user/zoo/foo.har的Hadoop Archives(归档)文件 ,然后你在MapReduce程序中就可以使用如下路径har:///user/zoo/foo.har作为输入文件。
由于Hadoop Archives(归档)文件是作为一种文件类型,MapReduce将能够使用Hadoop Archives(归档)文件中的所有逻辑输入文件作为输入源。

个人示例

  1. 准备文件

    1
    2
    3
    4
    5
    6
    7
    [hadoop@hadoop001 data]$ hdfs dfs -ls -R /user/hadoop/input
    -rw-r--r-- 1 hadoop supergroup 11 2021-12-19 15:54 /user/hadoop/input/a.log
    -rw-r--r-- 1 hadoop supergroup 18 2021-12-19 15:54 /user/hadoop/input/b.log
    -rw-r--r-- 1 hadoop supergroup 11 2021-12-19 15:54 /user/hadoop/input/c.log
    drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 /user/hadoop/input/d
    -rw-r--r-- 1 hadoop supergroup 4 2021-12-19 15:54 /user/hadoop/input/d/e.log

  2. 创建har文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    [hadoop@hadoop001 data]$ hadoop archive -archiveName input.har -p /user/hadoop/input /user/hadoop
    2021-12-19 15:56:44,393 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2021-12-19 15:56:45,593 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    2021-12-19 15:56:46,217 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    2021-12-19 15:56:46,258 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    2021-12-19 15:56:46,685 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1639763497373_0008
    2021-12-19 15:56:47,302 INFO mapreduce.JobSubmitter: number of splits:1
    2021-12-19 15:56:47,571 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639763497373_0008
    2021-12-19 15:56:47,578 INFO mapreduce.JobSubmitter: Executing with tokens: []
    2021-12-19 15:56:47,895 INFO conf.Configuration: resource-types.xml not found
    2021-12-19 15:56:47,895 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
    2021-12-19 15:56:48,044 INFO impl.YarnClientImpl: Submitted application application_1639763497373_0008
    2021-12-19 15:56:48,119 INFO mapreduce.Job: The url to track the job: http://hadoop001:8088/proxy/application_1639763497373_0008/
    2021-12-19 15:56:48,124 INFO mapreduce.Job: Running job: job_1639763497373_0008
    2021-12-19 15:56:58,359 INFO mapreduce.Job: Job job_1639763497373_0008 running in uber mode : false
    2021-12-19 15:56:58,361 INFO mapreduce.Job: map 0% reduce 0%
    2021-12-19 15:57:05,437 INFO mapreduce.Job: map 100% reduce 0%
    2021-12-19 15:57:12,484 INFO mapreduce.Job: map 100% reduce 100%
    2021-12-19 15:57:13,506 INFO mapreduce.Job: Job job_1639763497373_0008 completed successfully
    2021-12-19 15:57:13,611 INFO mapreduce.Job: Counters: 54
    File System Counters
    FILE: Number of bytes read=425
    FILE: Number of bytes written=473491
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=581
    HDFS: Number of bytes written=450
    HDFS: Number of read operations=24
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=12
    HDFS: Number of bytes read erasure-coded=0
    Job Counters
    Launched map tasks=1
    Launched reduce tasks=1
    Other local map tasks=1
    Total time spent by all maps in occupied slots (ms)=4796
    Total time spent by all reduces in occupied slots (ms)=4103
    Total time spent by all map tasks (ms)=4796
    Total time spent by all reduce tasks (ms)=4103
    Total vcore-milliseconds taken by all map tasks=4796
    Total vcore-milliseconds taken by all reduce tasks=4103
    Total megabyte-milliseconds taken by all map tasks=4911104
    Total megabyte-milliseconds taken by all reduce tasks=4201472
    Map-Reduce Framework
    Map input records=6
    Map output records=6
    Map output bytes=407
    Map output materialized bytes=425
    Input split bytes=118
    Combine input records=0
    Combine output records=0
    Reduce input groups=6
    Reduce shuffle bytes=425
    Reduce input records=6
    Reduce output records=0
    Spilled Records=12
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=181
    CPU time spent (ms)=1520
    Physical memory (bytes) snapshot=322760704
    Virtual memory (bytes) snapshot=5437816832
    Total committed heap usage (bytes)=170004480
    Peak Map Physical memory (bytes)=212164608
    Peak Map Virtual memory (bytes)=2717405184
    Peak Reduce Physical memory (bytes)=110596096
    Peak Reduce Virtual memory (bytes)=2720411648
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=419
    File Output Format Counters
    Bytes Written=0
    [hadoop@hadoop001 data]$ hdfs dfs -ls /user/hadoop/
    Found 2 items
    drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 /user/hadoop/input
    drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:57 /user/hadoop/input.har
  3. 查看文件组成结构

    1
    2
    3
    4
    5
    6
    7
    8
    [hadoop@hadoop001 data]$ hdfs dfs -cat /user/hadoop/input.har
    cat: `/user/hadoop/input.har': Is a directory
    [hadoop@hadoop001 data]$ hdfs dfs -ls /user/hadoop/input.har
    Found 4 items
    -rw-r--r-- 1 hadoop supergroup 0 2021-12-19 15:57 /user/hadoop/input.har/_SUCCESS
    -rw-r--r-- 3 hadoop supergroup 383 2021-12-19 15:57 /user/hadoop/input.har/_index
    -rw-r--r-- 3 hadoop supergroup 23 2021-12-19 15:57 /user/hadoop/input.har/_masterindex
    -rw-r--r-- 3 hadoop supergroup 44 2021-12-19 15:57 /user/hadoop/input.har/part-0
  4. 使用hdfs文件系统查看har文件目录内容

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    [hadoop@hadoop001 data]$ hdfs dfs -ls har:///user/hadoop/input.har
    Found 4 items
    -rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/a.log
    -rw-r--r-- 3 hadoop supergroup 18 2021-12-19 15:54 har:///user/hadoop/input.har/b.log
    -rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/c.log
    drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 har:///user/hadoop/input.har/d
    [hadoop@hadoop001 data]$ hdfs dfs -ls -R har:///user/hadoop/input.har
    2021-12-19 16:03:48,906 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    -rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/a.log
    -rw-r--r-- 3 hadoop supergroup 18 2021-12-19 15:54 har:///user/hadoop/input.har/b.log
    -rw-r--r-- 3 hadoop supergroup 11 2021-12-19 15:54 har:///user/hadoop/input.har/c.log
    drwxr-xr-x - hadoop supergroup 0 2021-12-19 15:54 har:///user/hadoop/input.har/d
    -rw-r--r-- 3 hadoop supergroup 4 2021-12-19 15:54 har:///user/hadoop/input.har/d/e.log