You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Chris Nauroth <cn...@gmail.com> on 2016/12/30 20:57:21 UTC

Re: merging small files in HDFS

Hello Piyush,

I would typically accomplish this sort of thing by using
CombineFileInputFormat, which is capable of combining multiple small files
into a single input split.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

This prevents launching a huge number of map tasks with each one performing
just a little bit of work to process each small file.  The job could use
the standard pass-through IdentityMapper, so that output records are
identical to the input records.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

The same data will be placed into a smaller number of files at the
destination.  The number of files can be controlled by setting the job's
number of reducers.  This is something you can tune toward your targeted
trade-off of number o -files vs. size of each file.

Then, you can adjust this pattern if you have additional data preparation
requirements such as compressing the output.

I hope this helps.

--Chris

On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <pi...@gmail.com>
wrote:

> Hi,
> thanks for the suggestion.
> "hadoop fs -getmerge"  is a good and simple solution for one time activity
> on few directory.
>  But It may have problems at scale as this solution copy the data to local
> from hdfs and then put it back to hdfs.
>  Also here we have to take care of compressing and decompressing
> separately .
> we need to run this merge every hour for thousands of directories.
>
>
>
> On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <se...@ebay.com>
> wrote:
>
>> Can't we use getmerge here ?  If you requirement is to merge some files
>> in a particular directory to single file ..
>>
>> hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
>>
>> --Senthil
>> -----Original Message-----
>> From: Giovanni Mascari [mailto:giovanni.mascari@polito.it]
>> Sent: Thursday, November 03, 2016 7:24 PM
>> To: Piyush Mukati <pi...@gmail.com>; user@hadoop.apache.org
>> Subject: Re: merging small files in HDFS
>>
>> Hi,
>> if I correctly understand your request you need only to merge some data
>> resulting from an hdfs write operation.
>> In this case, I suppose that your best option is to use hadoop-stream
>> with 'cat' command.
>>
>> take a look here:
>> https://hadoop.apache.org/docs/r1.2.1/streaming.html
>>
>> Regards
>>
>> Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>> > Hi,
>> > I want to merge multiple files in one HDFS dir to one file. I am
>> > planning to write a map only job using input format which will create
>> > only one inputSplit per dir.
>> > this way my job don't need to do any shuffle/sort.(only read and write
>> > back to disk) Is there any such file format already implemented ?
>> > Or any there better solution for the problem.
>> >
>> > thanks.
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: user-help@hadoop.apache.org
>>
>>
>


-- 
Chris Nauroth

Re: merging small files in HDFS

Posted by Gabriel Balan <ga...@oracle.com>.

Hi

Here's a couple more alternatives.

_If the goal is __writing the least amount of code_, I'd look into using hive. Create an external table over the dir with lots of small data files, and another external table over the dir where I want the compacted data files. Select * from one table and insert it into the other.

    Hive will use CombineFileInputFormat, and you don't have to subclass it to supply the record reader.


For _best performance_, I'd go for a map-only job, with an input format like NLineInputFormat, and a custom Map. The general idea is to have each mapper receive a number of data file *names*, and "cat" those data files explicitly. (if they're text files, you can stream the bytes raw; otherwise use an inner input format/record reader).

Here are some details:

  * List all the data files' names into a text file.
      o this is the input to the map-only job
      o hadoop fs -ls .... > file-list.txt

  * InputFormat:
      o you want to get as many splits as the desired number of output files
          + the number is a tradeoff between how few files you want and how fast you want this step to be.
          + if you want 1 file, then skip to "Mapper" below.
      o if the data file sizes don't vary wildly in size,
          + have each split consist of k lines (where k = #input files / # output files)
      o if data files size a very different, you need to override getSplits() to implement some simple bin-packing approx algorithm to group the files such the total size in each group is roughly the same. For instance, see https://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm (the generalized version).
  * Mapper
      o the input values: Text, each a name of a data file
      o if data files are text files:
          + create output file,
          + for each input value, open the data file with that name, stream it into the output file;
          + (you may need to add \n after each data file not ending in \n)
          + close the output file on Map::cleanup()
      o for arbitrary data formats:
          + you need to explicitly handle an inner input format/record reader to read from each data file
          + for each input value (a data file name),
              # make new conf, set mapred input dir to the data file's name.
              # have the inner input format give you a split
              # have the inner input format give you a record reader for that split
              # iterate over the record reader's k-v pairs, outputting them into to mapper's output.
              # (you need to set the output format appropriately)


my 2c

Gabriel Balan


On 12/30/2016 3:57 PM, Chris Nauroth wrote:
> Hello Piyush,
>
> I would typically accomplish this sort of thing by using CombineFileInputFormat, which is capable of combining multiple small files into a single input split.
>
> http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html
>
> This prevents launching a huge number of map tasks with each one performing just a little bit of work to process each small file.  The job could use the standard pass-through IdentityMapper, so that output records are identical to the input records.
>
> http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
>
> The same data will be placed into a smaller number of files at the destination.  The number of files can be controlled by setting the job's number of reducers.  This is something you can tune toward your targeted trade-off of number o -files vs. size of each file.
>
> Then, you can adjust this pattern if you have additional data preparation requirements such as compressing the output.
>
> I hope this helps.
>
> --Chris
>
> On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.mukati@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>     thanks for the suggestion.
>     "hadoop fs -getmerge"  is a good and simple solution for one time activity on few directory.
>      But It may have problems at scale as this solution copy the data to local from hdfs and then put it back to hdfs.
>      Also here we have to take care of compressing and decompressing separately .
>     we need to run this merge every hour for thousands of directories.
>
>
>
>     On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthikumar@ebay.com <ma...@ebay.com>> wrote:
>
>         Can't we use getmerge here ?  If you requirement is to merge some files in a particular directory to single file ..
>
>         hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
>
>         --Senthil
>         -----Original Message-----
>         From: Giovanni Mascari [mailto:giovanni.mascari@polito.it <ma...@polito.it>]
>         Sent: Thursday, November 03, 2016 7:24 PM
>         To: Piyush Mukati <piyush.mukati@gmail.com <ma...@gmail.com>>; user@hadoop.apache.org <ma...@hadoop.apache.org>
>         Subject: Re: merging small files in HDFS
>
>         Hi,
>         if I correctly understand your request you need only to merge some data resulting from an hdfs write operation.
>         In this case, I suppose that your best option is to use hadoop-stream with 'cat' command.
>
>         take a look here:
>         https://hadoop.apache.org/docs/r1.2.1/streaming.html <https://hadoop.apache.org/docs/r1.2.1/streaming.html>
>
>         Regards
>
>         Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>         > Hi,
>         > I want to merge multiple files in one HDFS dir to one file. I am
>         > planning to write a map only job using input format which will create
>         > only one inputSplit per dir.
>         > this way my job don't need to do any shuffle/sort.(only read and write
>         > back to disk) Is there any such file format already implemented ?
>         > Or any there better solution for the problem.
>         >
>         > thanks.
>         >
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org <ma...@hadoop.apache.org>
>         For additional commands, e-mail: user-help@hadoop.apache.org <ma...@hadoop.apache.org>
>
>
>
>
>
> -- 
> Chris Nauroth

-- 
The statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation.