You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Steve Loughran <st...@apache.org> on 2011/05/02 19:10:34 UTC

Re: Applications creates bigger output than input?

On 30/04/2011 05:31, elton sky wrote:
> Thank you for suggestions:
>
> Weblog analysis, market basket analysis and generating search index.
>
> I guess for these applications we need more reduces than maps, for handling
> large intermediate output, isn't it. Besides, the input split for map should
> be smaller than usual,  because the workload for spill and merge on map's
> local disk is heavy.

any form of rendering can generate very large images

see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

Re: Applications creates bigger output than input?

Posted by elton sky <el...@gmail.com>.

Thanks Robert, Niels

Ye, I think text manipulation, especially ngram is a good application for
me.
Cheers

On Fri, May 20, 2011 at 12:57 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> I'm not sure if this has been mentioned or not but in Machine Learning with
> text based documents, the first stage is often a glorified word count
> action.  Except much of the time they will do N-Gram.  So
>
> Map Input:
> "Hello this is a test"
>
> Map Output:
> "Hello"
> "This"
> "is"
> "a"
> "test"
> "Hello" "this"
> "this" "is"
> "is" "a"
> "a" "test"
> ...
>
>
> You may also be extracting all kinds of other features form the text, but
> the tokenization/n-gram is not that CPU intensive.
>
> --Bobby Evans
>
> On 5/19/11 3:06 AM, "elton sky" <el...@gmail.com> wrote:
>
> Hello,
> I pick up this topic again, because what I am looking for is something not
> CPU bound. Augmenting data for ETL and generating index are good examples.
> Neither of them requires too much cpu time on map side. The main bottle
> neck
> for them is shuffle and merge.
>
> Market basket analysis is cpu intensive in map phase, for sampling all
> possible combinations of items.
>
> I am still looking for more applications, which creates bigger output and
> not CPU bound.
> Any further idea? I appreciate.
>
>
> On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <st...@apache.org> wrote:
>
> > On 30/04/2011 05:31, elton sky wrote:
> >
> >> Thank you for suggestions:
> >>
> >> Weblog analysis, market basket analysis and generating search index.
> >>
> >> I guess for these applications we need more reduces than maps, for
> >> handling
> >> large intermediate output, isn't it. Besides, the input split for map
> >> should
> >> be smaller than usual,  because the workload for spill and merge on
> map's
> >> local disk is heavy.
> >>
> >
> > any form of rendering can generate very large images
> >
> > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
> >
> >
> >
>
>

Re: Applications creates bigger output than input?

Posted by Niels Basjes <Ni...@basjes.nl>.

Something I've seen in the past is code that has the input
   "something"
and outputs
   "s"
   "so"
   "som"
   "some"
   "somet"
   "someth"
   "somethi"
   "somethin"
   "something"

So the output number of records is the same as the length of the input text.

Niels

2011/5/19 elton sky <el...@gmail.com>:
> Hello,
> I pick up this topic again, because what I am looking for is something not
> CPU bound. Augmenting data for ETL and generating index are good examples.
> Neither of them requires too much cpu time on map side. The main bottle neck
> for them is shuffle and merge.
>
> Market basket analysis is cpu intensive in map phase, for sampling all
> possible combinations of items.
>
> I am still looking for more applications, which creates bigger output and
> not CPU bound.
> Any further idea? I appreciate.
>
>
> On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <st...@apache.org> wrote:
>
>> On 30/04/2011 05:31, elton sky wrote:
>>
>>> Thank you for suggestions:
>>>
>>> Weblog analysis, market basket analysis and generating search index.
>>>
>>> I guess for these applications we need more reduces than maps, for
>>> handling
>>> large intermediate output, isn't it. Besides, the input split for map
>>> should
>>> be smaller than usual,  because the workload for spill and merge on map's
>>> local disk is heavy.
>>>
>>
>> any form of rendering can generate very large images
>>
>> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
>>
>>
>>
>



-- 
Met vriendelijke groeten,

Niels Basjes

Re: Applications creates bigger output than input?

Posted by Robert Evans <ev...@yahoo-inc.com>.

I'm not sure if this has been mentioned or not but in Machine Learning with text based documents, the first stage is often a glorified word count action.  Except much of the time they will do N-Gram.  So

Map Input:
"Hello this is a test"

Map Output:
"Hello"
"This"
"is"
"a"
"test"
"Hello" "this"
"this" "is"
"is" "a"
"a" "test"
...

You may also be extracting all kinds of other features form the text, but the tokenization/n-gram is not that CPU intensive.

--Bobby Evans

On 5/19/11 3:06 AM, "elton sky" <el...@gmail.com> wrote:

Hello,
I pick up this topic again, because what I am looking for is something not
CPU bound. Augmenting data for ETL and generating index are good examples.
Neither of them requires too much cpu time on map side. The main bottle neck
for them is shuffle and merge.

Market basket analysis is cpu intensive in map phase, for sampling all
possible combinations of items.

I am still looking for more applications, which creates bigger output and
not CPU bound.
Any further idea? I appreciate.

On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <st...@apache.org> wrote:

> On 30/04/2011 05:31, elton sky wrote:
>
>> Thank you for suggestions:
>>
>> Weblog analysis, market basket analysis and generating search index.
>>
>> I guess for these applications we need more reduces than maps, for
>> handling
>> large intermediate output, isn't it. Besides, the input split for map
>> should
>> be smaller than usual,  because the workload for spill and merge on map's
>> local disk is heavy.
>>
>
> any form of rendering can generate very large images
>
> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
>
>
>

Re: Applications creates bigger output than input?

Posted by elton sky <el...@gmail.com>.

Hello,
I pick up this topic again, because what I am looking for is something not
CPU bound. Augmenting data for ETL and generating index are good examples.
Neither of them requires too much cpu time on map side. The main bottle neck
for them is shuffle and merge.

Market basket analysis is cpu intensive in map phase, for sampling all
possible combinations of items.

I am still looking for more applications, which creates bigger output and
not CPU bound.
Any further idea? I appreciate.

On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <st...@apache.org> wrote:

> On 30/04/2011 05:31, elton sky wrote:
>
>> Thank you for suggestions:
>>
>> Weblog analysis, market basket analysis and generating search index.
>>
>> I guess for these applications we need more reduces than maps, for
>> handling
>> large intermediate output, isn't it. Besides, the input split for map
>> should
>> be smaller than usual,  because the workload for spill and merge on map's
>> local disk is heavy.
>>
>
> any form of rendering can generate very large images
>
> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
>
>
>