You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Anna Lahoud <an...@gmail.com> on 2012/10/01 20:30:50 UTC

File block size use

I would like to be able to resize a set of inputs, already in SequenceFile
format, to be larger.

I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
get what I expected. The outputs were exactly the same as the inputs.

I also tried running a job with an IdentityMapper and IdentityReducer.
Although that approaches a better solution, it still requires that I know
in advance how many reducers I need to get better file sizes.

I was looking at the SequenceFile.Writer constructors and noticed that
there are block size parameters that can be used. Using a writer
constructed with a 512MB block size, there is nothing that splits the
output and I simply get a single file the size of my inputs.

What is the current standard for combining sequence files to create larger
files for map-reduce jobs? I have seen code that tracks what it writes into
the file, but that seems like the long version. I am hoping there is a
shorter path.

Thank you.

Anna

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Bejoy - I tried this technique a number of times, and was not able to get
this to work. My files remain as they were on input. Is there a version I
need (beyond 0.20.2) to make this work, or another setting that could
prevent it from working?

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Bejoy - I tried this technique a number of times, and was not able to get
this to work. My files remain as they were on input. Is there a version I
need (beyond 0.20.2) to make this work, or another setting that could
prevent it from working?

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

You are correct that I want to create a small number of large files from a
large number of small files. The only solution that has worked, as you say,
has been a custom M/R job. Thank you for the help and ideas.

On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> Anna
>
> I misunderstood your problem. I thought you wanted to change the block
> size of every file. I didn' t realize that you were aggregating multiple
> small files into different, albeit smaller, set of larger files of a bigger
> block size
> to improve performance.
>
> I think as Chris suggested you need to have a custom M/R job or you could
> probably get away with some scripting magic :-)
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com>
> *Sent:* Tuesday, October 9, 2012 7:01 AM
>
> *Subject:* Re: File block size use
>
> Raj - I was not able to get this to work either.
>
> On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com>wrote:
>
> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

You are correct that I want to create a small number of large files from a
large number of small files. The only solution that has worked, as you say,
has been a custom M/R job. Thank you for the help and ideas.

On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> Anna
>
> I misunderstood your problem. I thought you wanted to change the block
> size of every file. I didn' t realize that you were aggregating multiple
> small files into different, albeit smaller, set of larger files of a bigger
> block size
> to improve performance.
>
> I think as Chris suggested you need to have a custom M/R job or you could
> probably get away with some scripting magic :-)
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com>
> *Sent:* Tuesday, October 9, 2012 7:01 AM
>
> *Subject:* Re: File block size use
>
> Raj - I was not able to get this to work either.
>
> On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com>wrote:
>
> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

You are correct that I want to create a small number of large files from a
large number of small files. The only solution that has worked, as you say,
has been a custom M/R job. Thank you for the help and ideas.

On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> Anna
>
> I misunderstood your problem. I thought you wanted to change the block
> size of every file. I didn' t realize that you were aggregating multiple
> small files into different, albeit smaller, set of larger files of a bigger
> block size
> to improve performance.
>
> I think as Chris suggested you need to have a custom M/R job or you could
> probably get away with some scripting magic :-)
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com>
> *Sent:* Tuesday, October 9, 2012 7:01 AM
>
> *Subject:* Re: File block size use
>
> Raj - I was not able to get this to work either.
>
> On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com>wrote:
>
> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

You are correct that I want to create a small number of large files from a
large number of small files. The only solution that has worked, as you say,
has been a custom M/R job. Thank you for the help and ideas.

On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> Anna
>
> I misunderstood your problem. I thought you wanted to change the block
> size of every file. I didn' t realize that you were aggregating multiple
> small files into different, albeit smaller, set of larger files of a bigger
> block size
> to improve performance.
>
> I think as Chris suggested you need to have a custom M/R job or you could
> probably get away with some scripting magic :-)
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com>
> *Sent:* Tuesday, October 9, 2012 7:01 AM
>
> *Subject:* Re: File block size use
>
> Raj - I was not able to get this to work either.
>
> On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com>wrote:
>
> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Anna

I misunderstood your problem. I thought you wanted to change the block size of every file. I didn' t realize that you were aggregating multiple small files into different, albeit smaller, set of larger files of a bigger block size 
to improve performance. 

I think as Chris suggested you need to have a custom M/R job or you could probably get away with some scripting magic :-)

Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com> 
>Sent: Tuesday, October 9, 2012 7:01 AM
>Subject: Re: File block size use
> 
>
>Raj - I was not able to get this to work either. 
>
>
>On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:
>
>I haven't tried it but this should also work
>>
>>
>> hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>>
>>
>>
>>Raj
>>
>>
>>
>>>________________________________
>>> From: Anna Lahoud <an...@gmail.com>
>>>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>>>Sent: Tuesday, October 2, 2012 7:17 AM
>>>
>>>Subject: Re: File block size use
>>> 
>>>
>>>
>>>Thank you. I will try today.
>>>
>>>
>>>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>>>
>>>Hi Anna
>>>>
>>>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>>>Your job should be done.
>>>>
>>>>
>>>>Regards
>>>>Bejoy KS
>>>>
>>>>Sent from handheld, please excuse typos.
>>>>________________________________
>>>>
>>>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>>>To: <us...@hadoop.apache.org>
>>>>ReplyTo:  user@hadoop.apache.org 
>>>>Subject: Re: File block size use
>>>>
>>>>Hello Anna,
>>>>
>>>>
>>>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>>>
>>>>
>>>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>>>
>>>>
>>>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>>>
>>>>
>>>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>>>
>>>>
>>>>Hope this helps,
>>>>--Chris
>>>>
>>>>
>>>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>>>
>>>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>>>
>>>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>>>
>>>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>>>
>>>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>>>
>>>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>>>
>>>>>Thank you.
>>>>>
>>>>>Anna
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Anna

I misunderstood your problem. I thought you wanted to change the block size of every file. I didn' t realize that you were aggregating multiple small files into different, albeit smaller, set of larger files of a bigger block size 
to improve performance. 

I think as Chris suggested you need to have a custom M/R job or you could probably get away with some scripting magic :-)

Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com> 
>Sent: Tuesday, October 9, 2012 7:01 AM
>Subject: Re: File block size use
> 
>
>Raj - I was not able to get this to work either. 
>
>
>On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:
>
>I haven't tried it but this should also work
>>
>>
>> hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>>
>>
>>
>>Raj
>>
>>
>>
>>>________________________________
>>> From: Anna Lahoud <an...@gmail.com>
>>>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>>>Sent: Tuesday, October 2, 2012 7:17 AM
>>>
>>>Subject: Re: File block size use
>>> 
>>>
>>>
>>>Thank you. I will try today.
>>>
>>>
>>>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>>>
>>>Hi Anna
>>>>
>>>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>>>Your job should be done.
>>>>
>>>>
>>>>Regards
>>>>Bejoy KS
>>>>
>>>>Sent from handheld, please excuse typos.
>>>>________________________________
>>>>
>>>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>>>To: <us...@hadoop.apache.org>
>>>>ReplyTo:  user@hadoop.apache.org 
>>>>Subject: Re: File block size use
>>>>
>>>>Hello Anna,
>>>>
>>>>
>>>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>>>
>>>>
>>>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>>>
>>>>
>>>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>>>
>>>>
>>>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>>>
>>>>
>>>>Hope this helps,
>>>>--Chris
>>>>
>>>>
>>>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>>>
>>>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>>>
>>>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>>>
>>>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>>>
>>>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>>>
>>>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>>>
>>>>>Thank you.
>>>>>
>>>>>Anna
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Anna

I misunderstood your problem. I thought you wanted to change the block size of every file. I didn' t realize that you were aggregating multiple small files into different, albeit smaller, set of larger files of a bigger block size 
to improve performance. 

I think as Chris suggested you need to have a custom M/R job or you could probably get away with some scripting magic :-)

Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com> 
>Sent: Tuesday, October 9, 2012 7:01 AM
>Subject: Re: File block size use
> 
>
>Raj - I was not able to get this to work either. 
>
>
>On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:
>
>I haven't tried it but this should also work
>>
>>
>> hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>>
>>
>>
>>Raj
>>
>>
>>
>>>________________________________
>>> From: Anna Lahoud <an...@gmail.com>
>>>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>>>Sent: Tuesday, October 2, 2012 7:17 AM
>>>
>>>Subject: Re: File block size use
>>> 
>>>
>>>
>>>Thank you. I will try today.
>>>
>>>
>>>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>>>
>>>Hi Anna
>>>>
>>>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>>>Your job should be done.
>>>>
>>>>
>>>>Regards
>>>>Bejoy KS
>>>>
>>>>Sent from handheld, please excuse typos.
>>>>________________________________
>>>>
>>>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>>>To: <us...@hadoop.apache.org>
>>>>ReplyTo:  user@hadoop.apache.org 
>>>>Subject: Re: File block size use
>>>>
>>>>Hello Anna,
>>>>
>>>>
>>>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>>>
>>>>
>>>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>>>
>>>>
>>>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>>>
>>>>
>>>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>>>
>>>>
>>>>Hope this helps,
>>>>--Chris
>>>>
>>>>
>>>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>>>
>>>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>>>
>>>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>>>
>>>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>>>
>>>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>>>
>>>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>>>
>>>>>Thank you.
>>>>>
>>>>>Anna
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Anna

I misunderstood your problem. I thought you wanted to change the block size of every file. I didn' t realize that you were aggregating multiple small files into different, albeit smaller, set of larger files of a bigger block size 
to improve performance. 

I think as Chris suggested you need to have a custom M/R job or you could probably get away with some scripting magic :-)

Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com> 
>Sent: Tuesday, October 9, 2012 7:01 AM
>Subject: Re: File block size use
> 
>
>Raj - I was not able to get this to work either. 
>
>
>On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:
>
>I haven't tried it but this should also work
>>
>>
>> hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>>
>>
>>
>>Raj
>>
>>
>>
>>>________________________________
>>> From: Anna Lahoud <an...@gmail.com>
>>>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>>>Sent: Tuesday, October 2, 2012 7:17 AM
>>>
>>>Subject: Re: File block size use
>>> 
>>>
>>>
>>>Thank you. I will try today.
>>>
>>>
>>>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>>>
>>>Hi Anna
>>>>
>>>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>>>Your job should be done.
>>>>
>>>>
>>>>Regards
>>>>Bejoy KS
>>>>
>>>>Sent from handheld, please excuse typos.
>>>>________________________________
>>>>
>>>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>>>To: <us...@hadoop.apache.org>
>>>>ReplyTo:  user@hadoop.apache.org 
>>>>Subject: Re: File block size use
>>>>
>>>>Hello Anna,
>>>>
>>>>
>>>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>>>
>>>>
>>>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>>>
>>>>
>>>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>>>
>>>>
>>>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>>>
>>>>
>>>>Hope this helps,
>>>>--Chris
>>>>
>>>>
>>>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>>>
>>>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>>>
>>>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>>>
>>>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>>>
>>>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>>>
>>>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>>>
>>>>>Thank you.
>>>>>
>>>>>Anna
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Raj - I was not able to get this to work either.

On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Raj - I was not able to get this to work either.

On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Raj - I was not able to get this to work either.

On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Raj - I was not able to get this to work either.

On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:

> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <an...@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

I haven't tried it but this should also work

 hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest


Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>Sent: Tuesday, October 2, 2012 7:17 AM
>Subject: Re: File block size use
> 
>
>Thank you. I will try today.
>
>
>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
>Hi Anna
>>
>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>Your job should be done.
>>
>>
>>Regards
>>Bejoy KS
>>
>>Sent from handheld, please excuse typos.
>>________________________________
>>
>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>To: <us...@hadoop.apache.org>
>>ReplyTo:  user@hadoop.apache.org 
>>Subject: Re: File block size use
>>
>>Hello Anna,
>>
>>
>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>
>>
>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>
>>
>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>
>>
>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>
>>
>>Hope this helps,
>>--Chris
>>
>>
>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>
>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>
>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>
>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>
>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>
>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>
>>>Thank you.
>>>
>>>Anna
>>>
>>>
>>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

I haven't tried it but this should also work

 hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest


Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>Sent: Tuesday, October 2, 2012 7:17 AM
>Subject: Re: File block size use
> 
>
>Thank you. I will try today.
>
>
>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
>Hi Anna
>>
>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>Your job should be done.
>>
>>
>>Regards
>>Bejoy KS
>>
>>Sent from handheld, please excuse typos.
>>________________________________
>>
>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>To: <us...@hadoop.apache.org>
>>ReplyTo:  user@hadoop.apache.org 
>>Subject: Re: File block size use
>>
>>Hello Anna,
>>
>>
>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>
>>
>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>
>>
>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>
>>
>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>
>>
>>Hope this helps,
>>--Chris
>>
>>
>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>
>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>
>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>
>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>
>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>
>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>
>>>Thank you.
>>>
>>>Anna
>>>
>>>
>>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

I haven't tried it but this should also work

 hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest


Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>Sent: Tuesday, October 2, 2012 7:17 AM
>Subject: Re: File block size use
> 
>
>Thank you. I will try today.
>
>
>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
>Hi Anna
>>
>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>Your job should be done.
>>
>>
>>Regards
>>Bejoy KS
>>
>>Sent from handheld, please excuse typos.
>>________________________________
>>
>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>To: <us...@hadoop.apache.org>
>>ReplyTo:  user@hadoop.apache.org 
>>Subject: Re: File block size use
>>
>>Hello Anna,
>>
>>
>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>
>>
>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>
>>
>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>
>>
>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>
>>
>>Hope this helps,
>>--Chris
>>
>>
>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>
>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>
>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>
>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>
>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>
>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>
>>>Thank you.
>>>
>>>Anna
>>>
>>>
>>
>
>
>

Re: File block size use

Posted by Raj Vishwanathan <ra...@yahoo.com>.

I haven't tried it but this should also work

 hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest


Raj



>________________________________
> From: Anna Lahoud <an...@gmail.com>
>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com 
>Sent: Tuesday, October 2, 2012 7:17 AM
>Subject: Re: File block size use
> 
>
>Thank you. I will try today.
>
>
>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:
>
>Hi Anna
>>
>>If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
>>Your job should be done.
>>
>>
>>Regards
>>Bejoy KS
>>
>>Sent from handheld, please excuse typos.
>>________________________________
>>
>>From:  Chris Nauroth <cn...@hortonworks.com> 
>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>To: <us...@hadoop.apache.org>
>>ReplyTo:  user@hadoop.apache.org 
>>Subject: Re: File block size use
>>
>>Hello Anna,
>>
>>
>>If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size.  Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records.
>>
>>
>>Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation.  In my case, I was typically working with TextInputFormat (not sequence files).  I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data.  For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results.  (This may or may not be true for your data set though.)
>>
>>
>>A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output.  Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size.  This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly.
>>
>>
>>To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size).  I'm not aware of any built-in or external utilities that do this for you though.
>>
>>
>>Hope this helps,
>>--Chris
>>
>>
>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>>
>>I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. 
>>>
>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. 
>>>
>>>I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. 
>>>
>>>I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. 
>>>
>>>What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path.
>>>
>>>Thank you.
>>>
>>>Anna
>>>
>>>
>>
>
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Thank you. I will try today.

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Thank you. I will try today.

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Bejoy - I tried this technique a number of times, and was not able to get
this to work. My files remain as they were on input. Is there a version I
need (beyond 0.20.2) to make this work, or another setting that could
prevent it from working?

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Thank you. I will try today.

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Bejoy - I tried this technique a number of times, and was not able to get
this to work. My files remain as they were on input. Is there a version I
need (beyond 0.20.2) to make this work, or another setting that could
prevent it from working?

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Thank you. I will try today.

On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cn...@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Bejoy KS <be...@gmail.com>.

Hi Anna

If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
Your job should be done.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chris Nauroth <cn...@hortonworks.com>
Date: Mon, 1 Oct 2012 21:12:58 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: File block size use

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Chris - You are absolutely correct in what I am trying to accomplish -
decrease the number of files going to the maps. Admittedly, I haven't run
through all the suggestions yet today. I hope to do that by days' end.
Thank you and I will give an update later on what worked.

On Tue, Oct 2, 2012 at 12:12 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Chris - You are absolutely correct in what I am trying to accomplish -
decrease the number of files going to the maps. Admittedly, I haven't run
through all the suggestions yet today. I hope to do that by days' end.
Thank you and I will give an update later on what worked.

On Tue, Oct 2, 2012 at 12:12 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Chris - You are absolutely correct in what I am trying to accomplish -
decrease the number of files going to the maps. Admittedly, I haven't run
through all the suggestions yet today. I hope to do that by days' end.
Thank you and I will give an update later on what worked.

On Tue, Oct 2, 2012 at 12:12 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Bejoy KS <be...@gmail.com>.

Hi Anna

If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
Your job should be done.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chris Nauroth <cn...@hortonworks.com>
Date: Mon, 1 Oct 2012 21:12:58 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: File block size use

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Bejoy KS <be...@gmail.com>.

Hi Anna

If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
Your job should be done.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chris Nauroth <cn...@hortonworks.com>
Date: Mon, 1 Oct 2012 21:12:58 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: File block size use

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Bejoy KS <be...@gmail.com>.

Hi Anna

If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer.  Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job.
Your job should be done.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chris Nauroth <cn...@hortonworks.com>
Date: Mon, 1 Oct 2012 21:12:58 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: File block size use

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Anna Lahoud <an...@gmail.com>.

Chris - You are absolutely correct in what I am trying to accomplish -
decrease the number of files going to the maps. Admittedly, I haven't run
through all the suggestions yet today. I hope to do that by days' end.
Thank you and I will give an update later on what worked.

On Tue, Oct 2, 2012 at 12:12 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:
>
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>>
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>>
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>>
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>>
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>>
>> Thank you.
>>
>> Anna
>>
>>
>

Re: File block size use

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>

Re: File block size use

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <an...@gmail.com> wrote:

> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>