You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Jerry Lam <ch...@gmail.com> on 2013/08/20 00:09:08 UTC

produce a large sequencefile (1TB)

Hi Hadoop users and developers,

I have a use case that I need produce a large sequence file of 1 TB in size
when each datanode has  200GB of storage but I have 30 datanodes.

The problem is that no single reducer can hold 1TB of data during the
reduce phase to generate a single sequence file even I use aggressive
compression. Any datanode will run out of space since this is a single
reducer job.

Any comment and help is appreciated.

Jerry

Re: about append

Posted by Harsh J <ha...@cloudera.com>.

Append implementation in 1.x was the first take, and was found to
carry a few important issues, for which it therefore received a newer
implementation in what's now 0.23.x/2.x. Read
https://issues.apache.org/jira/browse/HDFS-265.

On Tue, Aug 20, 2013 at 3:58 PM, gsw204 <gs...@163.com> wrote:
> hi,
>    I want to know why append does not support in hadoop-1.1.x?
>
> ________________________________
> gsw
>
>

-- 
Harsh J

Re: about append

Posted by Harsh J <ha...@cloudera.com>.

Append implementation in 1.x was the first take, and was found to
carry a few important issues, for which it therefore received a newer
implementation in what's now 0.23.x/2.x. Read
https://issues.apache.org/jira/browse/HDFS-265.

On Tue, Aug 20, 2013 at 3:58 PM, gsw204 <gs...@163.com> wrote:
> hi,
>    I want to know why append does not support in hadoop-1.1.x?
>
> ________________________________
> gsw
>
>

-- 
Harsh J

答复: about append

Posted by "Francis.Hu" <fr...@reachjunction.com>.

it is due to that it cannot handle the situation well about the concurrent access to same file.

 

发件人: gsw204 [mailto:gsw204@163.com] 
发送时间: Tuesday, August 20, 2013 18:28
收件人: user
主题: about append 

 

hi,

   I want to know why append does not support in hadoop-1.1.x?

 

  _____  

gsw

Re: about append

Posted by Harsh J <ha...@cloudera.com>.

Append implementation in 1.x was the first take, and was found to
carry a few important issues, for which it therefore received a newer
implementation in what's now 0.23.x/2.x. Read
https://issues.apache.org/jira/browse/HDFS-265.

On Tue, Aug 20, 2013 at 3:58 PM, gsw204 <gs...@163.com> wrote:
> hi,
>    I want to know why append does not support in hadoop-1.1.x?
>
> ________________________________
> gsw
>
>

-- 
Harsh J

答复: about append

Posted by "Francis.Hu" <fr...@reachjunction.com>.

it is due to that it cannot handle the situation well about the concurrent access to same file.

 

发件人: gsw204 [mailto:gsw204@163.com] 
发送时间: Tuesday, August 20, 2013 18:28
收件人: user
主题: about append 

 

hi,

   I want to know why append does not support in hadoop-1.1.x?

 

  _____  

gsw

Re: about append

Posted by Harsh J <ha...@cloudera.com>.

Append implementation in 1.x was the first take, and was found to
carry a few important issues, for which it therefore received a newer
implementation in what's now 0.23.x/2.x. Read
https://issues.apache.org/jira/browse/HDFS-265.

On Tue, Aug 20, 2013 at 3:58 PM, gsw204 <gs...@163.com> wrote:
> hi,
>    I want to know why append does not support in hadoop-1.1.x?
>
> ________________________________
> gsw
>
>

-- 
Harsh J

答复: about append

Posted by "Francis.Hu" <fr...@reachjunction.com>.

it is due to that it cannot handle the situation well about the concurrent access to same file.

 

发件人: gsw204 [mailto:gsw204@163.com] 
发送时间: Tuesday, August 20, 2013 18:28
收件人: user
主题: about append 

 

hi,

   I want to know why append does not support in hadoop-1.1.x?

 

  _____  

gsw

答复: about append

Posted by "Francis.Hu" <fr...@reachjunction.com>.

it is due to that it cannot handle the situation well about the concurrent access to same file.

 

发件人: gsw204 [mailto:gsw204@163.com] 
发送时间: Tuesday, August 20, 2013 18:28
收件人: user
主题: about append 

 

hi,

   I want to know why append does not support in hadoop-1.1.x?

 

  _____  

gsw

about append

Posted by gsw204 <gs...@163.com>.

hi,
   I want to know why append does not support in hadoop-1.1.x?




gsw

about append

Posted by gsw204 <gs...@163.com>.

hi,
   I want to know why append does not support in hadoop-1.1.x?




gsw

about append

Posted by gsw204 <gs...@163.com>.

hi,
   I want to know why append does not support in hadoop-1.1.x?




gsw

about append

Posted by gsw204 <gs...@163.com>.

hi,
   I want to know why append does not support in hadoop-1.1.x?




gsw

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Harsh,

Thank you for the reply. It really answers my question and provides
practical advices.

Best Regards,

Jerry


On Tue, Aug 20, 2013 at 1:38 AM, Harsh J <ha...@cloudera.com> wrote:

> Unfortunately given the way Reducers work today you wouldn't be able
> to do this. They are designed to fetch all data before the merge, sort
> and process it through the reducer implementation. For that to work,
> as you've yourself deduced, you will need as much space locally
> available.
>
> What you could do however, is perhaps just run a Map-only job, let it
> produce smaller files, then run a non-MR java app that reads them all
> one by one, and appends to a single HDFS SequenceFile. This is like a
> reducer, but minus a local sort phase. If the sort is important to you
> as well, then your tweaking will have to go further into using
> multiple reducers with Total Order Partitioning, and then running this
> external java app.
>
> On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com>
> wrote:
> > Hi Jerry,
> >
> > I think whether it is acceptable to set multiple reducers to generate
> more
> > MapFile(IndexFile, DataFile)s.
> >
> > I want to know the real difficulties of multiply reducer to
> post-processing.
> > Maybe there are some questions about app?
> >
> >
> >
> > 2013/8/20 Jerry Lam <ch...@gmail.com>
> >>
> >> Hi Bing,
> >>
> >> you are correct. The local storage does not have enough capacity to hold
> >> the temporary files generated by the mappers. Since we want a single
> >> sequence file at the end, we are forced to use 1 reducer.
> >>
> >> The use case is that we want to generate an index for the 1TB sequence
> >> file that we can randomly access each row in the sequence file. In
> practice,
> >> this is simply a MapFile.
> >>
> >> Any idea how to resolve this dilemma is greatly appreciated.
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
> >> wrote:
> >>>
> >>> hi,Jerry.
> >>> I think you are worrying about the volumn of mapreduce local file, but
> >>> would  you give us more details about your apps.
> >>>
> >>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
> >>>>
> >>>> Hi Hadoop users and developers,
> >>>>
> >>>> I have a use case that I need produce a large sequence file of 1 TB in
> >>>> size when each datanode has  200GB of storage but I have 30 datanodes.
> >>>>
> >>>> The problem is that no single reducer can hold 1TB of data during the
> >>>> reduce phase to generate a single sequence file even I use aggressive
> >>>> compression. Any datanode will run out of space since this is a single
> >>>> reducer job.
> >>>>
> >>>> Any comment and help is appreciated.
> >>>>
> >>>> Jerry
> >>
> >>
> >
> >
> >
> > --
> > Bing Jiang
> > Tel：(86)134-2619-1361
> > weibo: http://weibo.com/jiangbinglover
> > BLOG: www.binospace.com
> > BLOG: http://blog.sina.com.cn/jiangbinglover
> > Focus on distributed computing, HDFS/HBase
>
>
>
> --
> Harsh J
>

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Harsh,

Thank you for the reply. It really answers my question and provides
practical advices.

Best Regards,

Jerry


On Tue, Aug 20, 2013 at 1:38 AM, Harsh J <ha...@cloudera.com> wrote:

> Unfortunately given the way Reducers work today you wouldn't be able
> to do this. They are designed to fetch all data before the merge, sort
> and process it through the reducer implementation. For that to work,
> as you've yourself deduced, you will need as much space locally
> available.
>
> What you could do however, is perhaps just run a Map-only job, let it
> produce smaller files, then run a non-MR java app that reads them all
> one by one, and appends to a single HDFS SequenceFile. This is like a
> reducer, but minus a local sort phase. If the sort is important to you
> as well, then your tweaking will have to go further into using
> multiple reducers with Total Order Partitioning, and then running this
> external java app.
>
> On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com>
> wrote:
> > Hi Jerry,
> >
> > I think whether it is acceptable to set multiple reducers to generate
> more
> > MapFile(IndexFile, DataFile)s.
> >
> > I want to know the real difficulties of multiply reducer to
> post-processing.
> > Maybe there are some questions about app?
> >
> >
> >
> > 2013/8/20 Jerry Lam <ch...@gmail.com>
> >>
> >> Hi Bing,
> >>
> >> you are correct. The local storage does not have enough capacity to hold
> >> the temporary files generated by the mappers. Since we want a single
> >> sequence file at the end, we are forced to use 1 reducer.
> >>
> >> The use case is that we want to generate an index for the 1TB sequence
> >> file that we can randomly access each row in the sequence file. In
> practice,
> >> this is simply a MapFile.
> >>
> >> Any idea how to resolve this dilemma is greatly appreciated.
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
> >> wrote:
> >>>
> >>> hi,Jerry.
> >>> I think you are worrying about the volumn of mapreduce local file, but
> >>> would  you give us more details about your apps.
> >>>
> >>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
> >>>>
> >>>> Hi Hadoop users and developers,
> >>>>
> >>>> I have a use case that I need produce a large sequence file of 1 TB in
> >>>> size when each datanode has  200GB of storage but I have 30 datanodes.
> >>>>
> >>>> The problem is that no single reducer can hold 1TB of data during the
> >>>> reduce phase to generate a single sequence file even I use aggressive
> >>>> compression. Any datanode will run out of space since this is a single
> >>>> reducer job.
> >>>>
> >>>> Any comment and help is appreciated.
> >>>>
> >>>> Jerry
> >>
> >>
> >
> >
> >
> > --
> > Bing Jiang
> > Tel：(86)134-2619-1361
> > weibo: http://weibo.com/jiangbinglover
> > BLOG: www.binospace.com
> > BLOG: http://blog.sina.com.cn/jiangbinglover
> > Focus on distributed computing, HDFS/HBase
>
>
>
> --
> Harsh J
>

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Harsh,

Thank you for the reply. It really answers my question and provides
practical advices.

Best Regards,

Jerry


On Tue, Aug 20, 2013 at 1:38 AM, Harsh J <ha...@cloudera.com> wrote:

> Unfortunately given the way Reducers work today you wouldn't be able
> to do this. They are designed to fetch all data before the merge, sort
> and process it through the reducer implementation. For that to work,
> as you've yourself deduced, you will need as much space locally
> available.
>
> What you could do however, is perhaps just run a Map-only job, let it
> produce smaller files, then run a non-MR java app that reads them all
> one by one, and appends to a single HDFS SequenceFile. This is like a
> reducer, but minus a local sort phase. If the sort is important to you
> as well, then your tweaking will have to go further into using
> multiple reducers with Total Order Partitioning, and then running this
> external java app.
>
> On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com>
> wrote:
> > Hi Jerry,
> >
> > I think whether it is acceptable to set multiple reducers to generate
> more
> > MapFile(IndexFile, DataFile)s.
> >
> > I want to know the real difficulties of multiply reducer to
> post-processing.
> > Maybe there are some questions about app?
> >
> >
> >
> > 2013/8/20 Jerry Lam <ch...@gmail.com>
> >>
> >> Hi Bing,
> >>
> >> you are correct. The local storage does not have enough capacity to hold
> >> the temporary files generated by the mappers. Since we want a single
> >> sequence file at the end, we are forced to use 1 reducer.
> >>
> >> The use case is that we want to generate an index for the 1TB sequence
> >> file that we can randomly access each row in the sequence file. In
> practice,
> >> this is simply a MapFile.
> >>
> >> Any idea how to resolve this dilemma is greatly appreciated.
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
> >> wrote:
> >>>
> >>> hi,Jerry.
> >>> I think you are worrying about the volumn of mapreduce local file, but
> >>> would  you give us more details about your apps.
> >>>
> >>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
> >>>>
> >>>> Hi Hadoop users and developers,
> >>>>
> >>>> I have a use case that I need produce a large sequence file of 1 TB in
> >>>> size when each datanode has  200GB of storage but I have 30 datanodes.
> >>>>
> >>>> The problem is that no single reducer can hold 1TB of data during the
> >>>> reduce phase to generate a single sequence file even I use aggressive
> >>>> compression. Any datanode will run out of space since this is a single
> >>>> reducer job.
> >>>>
> >>>> Any comment and help is appreciated.
> >>>>
> >>>> Jerry
> >>
> >>
> >
> >
> >
> > --
> > Bing Jiang
> > Tel：(86)134-2619-1361
> > weibo: http://weibo.com/jiangbinglover
> > BLOG: www.binospace.com
> > BLOG: http://blog.sina.com.cn/jiangbinglover
> > Focus on distributed computing, HDFS/HBase
>
>
>
> --
> Harsh J
>

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Harsh,

Thank you for the reply. It really answers my question and provides
practical advices.

Best Regards,

Jerry


On Tue, Aug 20, 2013 at 1:38 AM, Harsh J <ha...@cloudera.com> wrote:

> Unfortunately given the way Reducers work today you wouldn't be able
> to do this. They are designed to fetch all data before the merge, sort
> and process it through the reducer implementation. For that to work,
> as you've yourself deduced, you will need as much space locally
> available.
>
> What you could do however, is perhaps just run a Map-only job, let it
> produce smaller files, then run a non-MR java app that reads them all
> one by one, and appends to a single HDFS SequenceFile. This is like a
> reducer, but minus a local sort phase. If the sort is important to you
> as well, then your tweaking will have to go further into using
> multiple reducers with Total Order Partitioning, and then running this
> external java app.
>
> On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com>
> wrote:
> > Hi Jerry,
> >
> > I think whether it is acceptable to set multiple reducers to generate
> more
> > MapFile(IndexFile, DataFile)s.
> >
> > I want to know the real difficulties of multiply reducer to
> post-processing.
> > Maybe there are some questions about app?
> >
> >
> >
> > 2013/8/20 Jerry Lam <ch...@gmail.com>
> >>
> >> Hi Bing,
> >>
> >> you are correct. The local storage does not have enough capacity to hold
> >> the temporary files generated by the mappers. Since we want a single
> >> sequence file at the end, we are forced to use 1 reducer.
> >>
> >> The use case is that we want to generate an index for the 1TB sequence
> >> file that we can randomly access each row in the sequence file. In
> practice,
> >> this is simply a MapFile.
> >>
> >> Any idea how to resolve this dilemma is greatly appreciated.
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
> >> wrote:
> >>>
> >>> hi,Jerry.
> >>> I think you are worrying about the volumn of mapreduce local file, but
> >>> would  you give us more details about your apps.
> >>>
> >>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
> >>>>
> >>>> Hi Hadoop users and developers,
> >>>>
> >>>> I have a use case that I need produce a large sequence file of 1 TB in
> >>>> size when each datanode has  200GB of storage but I have 30 datanodes.
> >>>>
> >>>> The problem is that no single reducer can hold 1TB of data during the
> >>>> reduce phase to generate a single sequence file even I use aggressive
> >>>> compression. Any datanode will run out of space since this is a single
> >>>> reducer job.
> >>>>
> >>>> Any comment and help is appreciated.
> >>>>
> >>>> Jerry
> >>
> >>
> >
> >
> >
> > --
> > Bing Jiang
> > Tel：(86)134-2619-1361
> > weibo: http://weibo.com/jiangbinglover
> > BLOG: www.binospace.com
> > BLOG: http://blog.sina.com.cn/jiangbinglover
> > Focus on distributed computing, HDFS/HBase
>
>
>
> --
> Harsh J
>

Re: produce a large sequencefile (1TB)

Posted by Harsh J <ha...@cloudera.com>.

Unfortunately given the way Reducers work today you wouldn't be able
to do this. They are designed to fetch all data before the merge, sort
and process it through the reducer implementation. For that to work,
as you've yourself deduced, you will need as much space locally
available.

What you could do however, is perhaps just run a Map-only job, let it
produce smaller files, then run a non-MR java app that reads them all
one by one, and appends to a single HDFS SequenceFile. This is like a
reducer, but minus a local sort phase. If the sort is important to you
as well, then your tweaking will have to go further into using
multiple reducers with Total Order Partitioning, and then running this
external java app.

On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com> wrote:
> Hi Jerry,
>
> I think whether it is acceptable to set multiple reducers to generate more
> MapFile(IndexFile, DataFile)s.
>
> I want to know the real difficulties of multiply reducer to post-processing.
> Maybe there are some questions about app?
>
>
>
> 2013/8/20 Jerry Lam <ch...@gmail.com>
>>
>> Hi Bing,
>>
>> you are correct. The local storage does not have enough capacity to hold
>> the temporary files generated by the mappers. Since we want a single
>> sequence file at the end, we are forced to use 1 reducer.
>>
>> The use case is that we want to generate an index for the 1TB sequence
>> file that we can randomly access each row in the sequence file. In practice,
>> this is simply a MapFile.
>>
>> Any idea how to resolve this dilemma is greatly appreciated.
>>
>> Jerry
>>
>>
>>
>> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>>
>>> hi,Jerry.
>>> I think you are worrying about the volumn of mapreduce local file, but
>>> would  you give us more details about your apps.
>>>
>>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>>>
>>>> Hi Hadoop users and developers,
>>>>
>>>> I have a use case that I need produce a large sequence file of 1 TB in
>>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>>
>>>> The problem is that no single reducer can hold 1TB of data during the
>>>> reduce phase to generate a single sequence file even I use aggressive
>>>> compression. Any datanode will run out of space since this is a single
>>>> reducer job.
>>>>
>>>> Any comment and help is appreciated.
>>>>
>>>> Jerry
>>
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase



-- 
Harsh J

Re: produce a large sequencefile (1TB)

Posted by Harsh J <ha...@cloudera.com>.

Unfortunately given the way Reducers work today you wouldn't be able
to do this. They are designed to fetch all data before the merge, sort
and process it through the reducer implementation. For that to work,
as you've yourself deduced, you will need as much space locally
available.

What you could do however, is perhaps just run a Map-only job, let it
produce smaller files, then run a non-MR java app that reads them all
one by one, and appends to a single HDFS SequenceFile. This is like a
reducer, but minus a local sort phase. If the sort is important to you
as well, then your tweaking will have to go further into using
multiple reducers with Total Order Partitioning, and then running this
external java app.

On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com> wrote:
> Hi Jerry,
>
> I think whether it is acceptable to set multiple reducers to generate more
> MapFile(IndexFile, DataFile)s.
>
> I want to know the real difficulties of multiply reducer to post-processing.
> Maybe there are some questions about app?
>
>
>
> 2013/8/20 Jerry Lam <ch...@gmail.com>
>>
>> Hi Bing,
>>
>> you are correct. The local storage does not have enough capacity to hold
>> the temporary files generated by the mappers. Since we want a single
>> sequence file at the end, we are forced to use 1 reducer.
>>
>> The use case is that we want to generate an index for the 1TB sequence
>> file that we can randomly access each row in the sequence file. In practice,
>> this is simply a MapFile.
>>
>> Any idea how to resolve this dilemma is greatly appreciated.
>>
>> Jerry
>>
>>
>>
>> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>>
>>> hi,Jerry.
>>> I think you are worrying about the volumn of mapreduce local file, but
>>> would  you give us more details about your apps.
>>>
>>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>>>
>>>> Hi Hadoop users and developers,
>>>>
>>>> I have a use case that I need produce a large sequence file of 1 TB in
>>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>>
>>>> The problem is that no single reducer can hold 1TB of data during the
>>>> reduce phase to generate a single sequence file even I use aggressive
>>>> compression. Any datanode will run out of space since this is a single
>>>> reducer job.
>>>>
>>>> Any comment and help is appreciated.
>>>>
>>>> Jerry
>>
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase



-- 
Harsh J

Re: produce a large sequencefile (1TB)

Posted by Harsh J <ha...@cloudera.com>.

Unfortunately given the way Reducers work today you wouldn't be able
to do this. They are designed to fetch all data before the merge, sort
and process it through the reducer implementation. For that to work,
as you've yourself deduced, you will need as much space locally
available.

What you could do however, is perhaps just run a Map-only job, let it
produce smaller files, then run a non-MR java app that reads them all
one by one, and appends to a single HDFS SequenceFile. This is like a
reducer, but minus a local sort phase. If the sort is important to you
as well, then your tweaking will have to go further into using
multiple reducers with Total Order Partitioning, and then running this
external java app.

On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com> wrote:
> Hi Jerry,
>
> I think whether it is acceptable to set multiple reducers to generate more
> MapFile(IndexFile, DataFile)s.
>
> I want to know the real difficulties of multiply reducer to post-processing.
> Maybe there are some questions about app?
>
>
>
> 2013/8/20 Jerry Lam <ch...@gmail.com>
>>
>> Hi Bing,
>>
>> you are correct. The local storage does not have enough capacity to hold
>> the temporary files generated by the mappers. Since we want a single
>> sequence file at the end, we are forced to use 1 reducer.
>>
>> The use case is that we want to generate an index for the 1TB sequence
>> file that we can randomly access each row in the sequence file. In practice,
>> this is simply a MapFile.
>>
>> Any idea how to resolve this dilemma is greatly appreciated.
>>
>> Jerry
>>
>>
>>
>> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>>
>>> hi,Jerry.
>>> I think you are worrying about the volumn of mapreduce local file, but
>>> would  you give us more details about your apps.
>>>
>>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>>>
>>>> Hi Hadoop users and developers,
>>>>
>>>> I have a use case that I need produce a large sequence file of 1 TB in
>>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>>
>>>> The problem is that no single reducer can hold 1TB of data during the
>>>> reduce phase to generate a single sequence file even I use aggressive
>>>> compression. Any datanode will run out of space since this is a single
>>>> reducer job.
>>>>
>>>> Any comment and help is appreciated.
>>>>
>>>> Jerry
>>
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase



-- 
Harsh J

Re: produce a large sequencefile (1TB)

Posted by Harsh J <ha...@cloudera.com>.

Unfortunately given the way Reducers work today you wouldn't be able
to do this. They are designed to fetch all data before the merge, sort
and process it through the reducer implementation. For that to work,
as you've yourself deduced, you will need as much space locally
available.

What you could do however, is perhaps just run a Map-only job, let it
produce smaller files, then run a non-MR java app that reads them all
one by one, and appends to a single HDFS SequenceFile. This is like a
reducer, but minus a local sort phase. If the sort is important to you
as well, then your tweaking will have to go further into using
multiple reducers with Total Order Partitioning, and then running this
external java app.

On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <ji...@gmail.com> wrote:
> Hi Jerry,
>
> I think whether it is acceptable to set multiple reducers to generate more
> MapFile(IndexFile, DataFile)s.
>
> I want to know the real difficulties of multiply reducer to post-processing.
> Maybe there are some questions about app?
>
>
>
> 2013/8/20 Jerry Lam <ch...@gmail.com>
>>
>> Hi Bing,
>>
>> you are correct. The local storage does not have enough capacity to hold
>> the temporary files generated by the mappers. Since we want a single
>> sequence file at the end, we are forced to use 1 reducer.
>>
>> The use case is that we want to generate an index for the 1TB sequence
>> file that we can randomly access each row in the sequence file. In practice,
>> this is simply a MapFile.
>>
>> Any idea how to resolve this dilemma is greatly appreciated.
>>
>> Jerry
>>
>>
>>
>> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>>
>>> hi,Jerry.
>>> I think you are worrying about the volumn of mapreduce local file, but
>>> would  you give us more details about your apps.
>>>
>>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>>>
>>>> Hi Hadoop users and developers,
>>>>
>>>> I have a use case that I need produce a large sequence file of 1 TB in
>>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>>
>>>> The problem is that no single reducer can hold 1TB of data during the
>>>> reduce phase to generate a single sequence file even I use aggressive
>>>> compression. Any datanode will run out of space since this is a single
>>>> reducer job.
>>>>
>>>> Any comment and help is appreciated.
>>>>
>>>> Jerry
>>
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase



-- 
Harsh J

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

Hi Jerry,

I think whether it is acceptable to set multiple reducers to generate more
MapFile(IndexFile, DataFile)s.

I want to know the real difficulties of multiply reducer to
post-processing. Maybe there are some questions about app?



2013/8/20 Jerry Lam <ch...@gmail.com>

> Hi Bing,
>
> you are correct. The local storage does not have enough capacity to hold
> the temporary files generated by the mappers. Since we want a single
> sequence file at the end, we are forced to use 1 reducer.
>
> The use case is that we want to generate an index for the 1TB sequence
> file that we can randomly access each row in the sequence file. In
> practice, this is simply a MapFile.
>
> Any idea how to resolve this dilemma is greatly appreciated.
>
> Jerry
>
>
>
> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:
>
>> hi,Jerry.
>> I think you are worrying about the volumn of mapreduce local file, but
>> would  you give us more details about your apps.
>>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>
>>> Hi Hadoop users and developers,
>>>
>>> I have a use case that I need produce a large sequence file of 1 TB in
>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>
>>> The problem is that no single reducer can hold 1TB of data during the
>>> reduce phase to generate a single sequence file even I use aggressive
>>> compression. Any datanode will run out of space since this is a single
>>> reducer job.
>>>
>>> Any comment and help is appreciated.
>>>
>>> Jerry
>>>
>>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

Hi Jerry,

I think whether it is acceptable to set multiple reducers to generate more
MapFile(IndexFile, DataFile)s.

I want to know the real difficulties of multiply reducer to
post-processing. Maybe there are some questions about app?



2013/8/20 Jerry Lam <ch...@gmail.com>

> Hi Bing,
>
> you are correct. The local storage does not have enough capacity to hold
> the temporary files generated by the mappers. Since we want a single
> sequence file at the end, we are forced to use 1 reducer.
>
> The use case is that we want to generate an index for the 1TB sequence
> file that we can randomly access each row in the sequence file. In
> practice, this is simply a MapFile.
>
> Any idea how to resolve this dilemma is greatly appreciated.
>
> Jerry
>
>
>
> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:
>
>> hi,Jerry.
>> I think you are worrying about the volumn of mapreduce local file, but
>> would  you give us more details about your apps.
>>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>
>>> Hi Hadoop users and developers,
>>>
>>> I have a use case that I need produce a large sequence file of 1 TB in
>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>
>>> The problem is that no single reducer can hold 1TB of data during the
>>> reduce phase to generate a single sequence file even I use aggressive
>>> compression. Any datanode will run out of space since this is a single
>>> reducer job.
>>>
>>> Any comment and help is appreciated.
>>>
>>> Jerry
>>>
>>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

Hi Jerry,

I think whether it is acceptable to set multiple reducers to generate more
MapFile(IndexFile, DataFile)s.

I want to know the real difficulties of multiply reducer to
post-processing. Maybe there are some questions about app?



2013/8/20 Jerry Lam <ch...@gmail.com>

> Hi Bing,
>
> you are correct. The local storage does not have enough capacity to hold
> the temporary files generated by the mappers. Since we want a single
> sequence file at the end, we are forced to use 1 reducer.
>
> The use case is that we want to generate an index for the 1TB sequence
> file that we can randomly access each row in the sequence file. In
> practice, this is simply a MapFile.
>
> Any idea how to resolve this dilemma is greatly appreciated.
>
> Jerry
>
>
>
> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:
>
>> hi,Jerry.
>> I think you are worrying about the volumn of mapreduce local file, but
>> would  you give us more details about your apps.
>>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>
>>> Hi Hadoop users and developers,
>>>
>>> I have a use case that I need produce a large sequence file of 1 TB in
>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>
>>> The problem is that no single reducer can hold 1TB of data during the
>>> reduce phase to generate a single sequence file even I use aggressive
>>> compression. Any datanode will run out of space since this is a single
>>> reducer job.
>>>
>>> Any comment and help is appreciated.
>>>
>>> Jerry
>>>
>>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

Hi Jerry,

I think whether it is acceptable to set multiple reducers to generate more
MapFile(IndexFile, DataFile)s.

I want to know the real difficulties of multiply reducer to
post-processing. Maybe there are some questions about app?



2013/8/20 Jerry Lam <ch...@gmail.com>

> Hi Bing,
>
> you are correct. The local storage does not have enough capacity to hold
> the temporary files generated by the mappers. Since we want a single
> sequence file at the end, we are forced to use 1 reducer.
>
> The use case is that we want to generate an index for the 1TB sequence
> file that we can randomly access each row in the sequence file. In
> practice, this is simply a MapFile.
>
> Any idea how to resolve this dilemma is greatly appreciated.
>
> Jerry
>
>
>
> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:
>
>> hi,Jerry.
>> I think you are worrying about the volumn of mapreduce local file, but
>> would  you give us more details about your apps.
>>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>>
>>> Hi Hadoop users and developers,
>>>
>>> I have a use case that I need produce a large sequence file of 1 TB in
>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>
>>> The problem is that no single reducer can hold 1TB of data during the
>>> reduce phase to generate a single sequence file even I use aggressive
>>> compression. Any datanode will run out of space since this is a single
>>> reducer job.
>>>
>>> Any comment and help is appreciated.
>>>
>>> Jerry
>>>
>>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Bing,

you are correct. The local storage does not have enough capacity to hold
the temporary files generated by the mappers. Since we want a single
sequence file at the end, we are forced to use 1 reducer.

The use case is that we want to generate an index for the 1TB sequence file
that we can randomly access each row in the sequence file. In practice,
this is simply a MapFile.

Any idea how to resolve this dilemma is greatly appreciated.

Jerry

On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:

> hi,Jerry.
> I think you are worrying about the volumn of mapreduce local file, but
> would  you give us more details about your apps.
>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>
>> Hi Hadoop users and developers,
>>
>> I have a use case that I need produce a large sequence file of 1 TB in
>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>
>> The problem is that no single reducer can hold 1TB of data during the
>> reduce phase to generate a single sequence file even I use aggressive
>> compression. Any datanode will run out of space since this is a single
>> reducer job.
>>
>> Any comment and help is appreciated.
>>
>> Jerry
>>
>

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Bing,

you are correct. The local storage does not have enough capacity to hold
the temporary files generated by the mappers. Since we want a single
sequence file at the end, we are forced to use 1 reducer.

The use case is that we want to generate an index for the 1TB sequence file
that we can randomly access each row in the sequence file. In practice,
this is simply a MapFile.

Any idea how to resolve this dilemma is greatly appreciated.

Jerry

On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:

> hi,Jerry.
> I think you are worrying about the volumn of mapreduce local file, but
> would  you give us more details about your apps.
>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>
>> Hi Hadoop users and developers,
>>
>> I have a use case that I need produce a large sequence file of 1 TB in
>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>
>> The problem is that no single reducer can hold 1TB of data during the
>> reduce phase to generate a single sequence file even I use aggressive
>> compression. Any datanode will run out of space since this is a single
>> reducer job.
>>
>> Any comment and help is appreciated.
>>
>> Jerry
>>
>

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Bing,

you are correct. The local storage does not have enough capacity to hold
the temporary files generated by the mappers. Since we want a single
sequence file at the end, we are forced to use 1 reducer.

The use case is that we want to generate an index for the 1TB sequence file
that we can randomly access each row in the sequence file. In practice,
this is simply a MapFile.

Any idea how to resolve this dilemma is greatly appreciated.

Jerry

On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:

> hi,Jerry.
> I think you are worrying about the volumn of mapreduce local file, but
> would  you give us more details about your apps.
>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>
>> Hi Hadoop users and developers,
>>
>> I have a use case that I need produce a large sequence file of 1 TB in
>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>
>> The problem is that no single reducer can hold 1TB of data during the
>> reduce phase to generate a single sequence file even I use aggressive
>> compression. Any datanode will run out of space since this is a single
>> reducer job.
>>
>> Any comment and help is appreciated.
>>
>> Jerry
>>
>

Re: produce a large sequencefile (1TB)

Posted by Jerry Lam <ch...@gmail.com>.

Hi Bing,

you are correct. The local storage does not have enough capacity to hold
the temporary files generated by the mappers. Since we want a single
sequence file at the end, we are forced to use 1 reducer.

The use case is that we want to generate an index for the 1TB sequence file
that we can randomly access each row in the sequence file. In practice,
this is simply a MapFile.

Any idea how to resolve this dilemma is greatly appreciated.

Jerry

On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <ji...@gmail.com>wrote:

> hi,Jerry.
> I think you are worrying about the volumn of mapreduce local file, but
> would  you give us more details about your apps.
>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:
>
>> Hi Hadoop users and developers,
>>
>> I have a use case that I need produce a large sequence file of 1 TB in
>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>
>> The problem is that no single reducer can hold 1TB of data during the
>> reduce phase to generate a single sequence file even I use aggressive
>> compression. Any datanode will run out of space since this is a single
>> reducer job.
>>
>> Any comment and help is appreciated.
>>
>> Jerry
>>
>

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

hi,Jerry.
I think you are worrying about the volumn of mapreduce local file, but
would  you give us more details about your apps.
 On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:

> Hi Hadoop users and developers,
>
> I have a use case that I need produce a large sequence file of 1 TB in
> size when each datanode has  200GB of storage but I have 30 datanodes.
>
> The problem is that no single reducer can hold 1TB of data during the
> reduce phase to generate a single sequence file even I use aggressive
> compression. Any datanode will run out of space since this is a single
> reducer job.
>
> Any comment and help is appreciated.
>
> Jerry
>

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

hi,Jerry.
I think you are worrying about the volumn of mapreduce local file, but
would  you give us more details about your apps.
 On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:

> Hi Hadoop users and developers,
>
> I have a use case that I need produce a large sequence file of 1 TB in
> size when each datanode has  200GB of storage but I have 30 datanodes.
>
> The problem is that no single reducer can hold 1TB of data during the
> reduce phase to generate a single sequence file even I use aggressive
> compression. Any datanode will run out of space since this is a single
> reducer job.
>
> Any comment and help is appreciated.
>
> Jerry
>

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

hi,Jerry.
I think you are worrying about the volumn of mapreduce local file, but
would  you give us more details about your apps.
 On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:

> Hi Hadoop users and developers,
>
> I have a use case that I need produce a large sequence file of 1 TB in
> size when each datanode has  200GB of storage but I have 30 datanodes.
>
> The problem is that no single reducer can hold 1TB of data during the
> reduce phase to generate a single sequence file even I use aggressive
> compression. Any datanode will run out of space since this is a single
> reducer job.
>
> Any comment and help is appreciated.
>
> Jerry
>

Re: produce a large sequencefile (1TB)

Posted by Bing Jiang <ji...@gmail.com>.

hi,Jerry.
I think you are worrying about the volumn of mapreduce local file, but
would  you give us more details about your apps.
 On Aug 20, 2013 6:09 AM, "Jerry Lam" <ch...@gmail.com> wrote:

> Hi Hadoop users and developers,
>
> I have a use case that I need produce a large sequence file of 1 TB in
> size when each datanode has  200GB of storage but I have 30 datanodes.
>
> The problem is that no single reducer can hold 1TB of data during the
> reduce phase to generate a single sequence file even I use aggressive
> compression. Any datanode will run out of space since this is a single
> reducer job.
>
> Any comment and help is appreciated.
>
> Jerry
>