You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Sigurd Spieckermann <si...@gmail.com> on 2012/09/25 10:32:50 UTC

Join-package combiner number of input and output records the same

Hi guys,

I'm experiencing a strange behavior when I use the Hadoop join-package.
After running a job the result statistics show that my combiner has an
input of 100 records and an output of 100 records. From the task I'm
running and the way it's implemented, I know that each key appears multiple
times and the values should be combinable before getting passed to the
reducer. I'm running my tests in pseudo-distributed mode with one or two
map tasks. From using the debugger, I noticed that each key-value pair is
processed by a combiner individually so there's actually no list passed
into the combiner that it could aggregate. Can anyone think of a reason
that causes this undesired behavior?

Thanks
Sigurd

Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Hi,

Could you provide an approximation of the data volumes you are dealing with?
If I understand correctly, your map tasks produce almost nothing (it
depends on the size of your key/value, I guess).
My questions are 1) is the combiner really useful in your context? 2) is
the reducer really useful in your context?

Back to your problem, when a map task is done, the sorted output can
already be send to the reducer.
So waiting for the node do to all its map tasks may not be the best
solution. Furthermore, you have to know how many tasks this node will have,
knowing which one is the last one is not obvious (unless you wait for
absolutely all map tasks to finish...) and if the node is lost, all
computations should be redone again... So doing a bigger combiner might not
be a best solution, at least if the general case is considered.

A solution might be to skip combiner and reducer, put the output of the map
tasks into a datastore and work on that. But it will depend on your
context, of course.

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:34 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> I'm not doing a conventional join, but in my case one split/file consists
> of only one key-value pair. I'm not using default mapper/reducer
> implementations. I'm guessing the problem is that a combiner is only
> applied to the output of a map task which is an instance of the mapper
> class, but one map task processes one split and since I only have one
> key-value pair per split, there is nothing to combine. What I would need is
> a combiner across multiple map tasks or a way to treat all splits of a
> datanode as one, hence there would only be one map task. Is there a way to
> do something like that? Reusing the JVM hasn't worked in my tests.
>
> Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
>
>> Ups, sorry. You are using standart implementations? I dont know whats
>> happening then. Sorry. But the fact, that your inputsize equals your
>> outputsize in a "join" process reminded me too much of my own problems.
>> Sorry for confusion, i may have caused.
>>
>> Best,
>> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
>> <mailto:macek@cs.uni-kassel.de**>>:
>>
>>  Hi,
>>>
>>> i had this problem once too. Did you properly overwrite the reduce
>>> method with the @override annotation?
>>> Does your reduce method use OutputCollector or Context for gathering
>>> outputs? If you are using current version, it has to be Context.
>>>
>>> The thing is: if you do NOT override the standart reduce function
>>> (identity) is used and this results ofc in the same number of tuples
>>> as you read as input.
>>>
>>> Good luck!
>>> Elmar
>>>
>>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>
>>> >>:
>>>
>>>  I think I have tracked down the problem to the point that each split
>>>> only contains one big key-value pair and a combiner is connected to a
>>>> map task. Please correct me if I'm wrong, but I assume each map task
>>>> takes one split and the combiner operates only on the key-value pairs
>>>> within one split. That's why the combiner has no effect in my case.
>>>> Is there a way to combine the mapper outputs of multiple splits
>>>> before they are sent off to the reducer?
>>>>
>>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>> <mailto:sigurd.spieckermann@**gmail.com <si...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>     Maybe one more note: the combiner and the reducer class are the
>>>>     same and in the reduce-phase the values get aggregated correctly.
>>>>     Why is this not happening in the combiner-phase?
>>>>
>>>>
>>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>>     <ma...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>         Hi guys,
>>>>
>>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>>         join-package. After running a job the result statistics show
>>>>         that my combiner has an input of 100 records and an output of
>>>>         100 records. From the task I'm running and the way it's
>>>>         implemented, I know that each key appears multiple times and
>>>>         the values should be combinable before getting passed to the
>>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>>         one or two map tasks. From using the debugger, I noticed that
>>>>         each key-value pair is processed by a combiner individually
>>>>         so there's actually no list passed into the combiner that it
>>>>         could aggregate. Can anyone think of a reason that causes
>>>>         this undesired behavior?
>>>>
>>>>         Thanks
>>>>         Sigurd
>>>>
>>>>
>>>>
>>>>
>>>
>>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Hi,

Could you provide an approximation of the data volumes you are dealing with?
If I understand correctly, your map tasks produce almost nothing (it
depends on the size of your key/value, I guess).
My questions are 1) is the combiner really useful in your context? 2) is
the reducer really useful in your context?

Back to your problem, when a map task is done, the sorted output can
already be send to the reducer.
So waiting for the node do to all its map tasks may not be the best
solution. Furthermore, you have to know how many tasks this node will have,
knowing which one is the last one is not obvious (unless you wait for
absolutely all map tasks to finish...) and if the node is lost, all
computations should be redone again... So doing a bigger combiner might not
be a best solution, at least if the general case is considered.

A solution might be to skip combiner and reducer, put the output of the map
tasks into a datastore and work on that. But it will depend on your
context, of course.

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:34 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> I'm not doing a conventional join, but in my case one split/file consists
> of only one key-value pair. I'm not using default mapper/reducer
> implementations. I'm guessing the problem is that a combiner is only
> applied to the output of a map task which is an instance of the mapper
> class, but one map task processes one split and since I only have one
> key-value pair per split, there is nothing to combine. What I would need is
> a combiner across multiple map tasks or a way to treat all splits of a
> datanode as one, hence there would only be one map task. Is there a way to
> do something like that? Reusing the JVM hasn't worked in my tests.
>
> Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
>
>> Ups, sorry. You are using standart implementations? I dont know whats
>> happening then. Sorry. But the fact, that your inputsize equals your
>> outputsize in a "join" process reminded me too much of my own problems.
>> Sorry for confusion, i may have caused.
>>
>> Best,
>> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
>> <mailto:macek@cs.uni-kassel.de**>>:
>>
>>  Hi,
>>>
>>> i had this problem once too. Did you properly overwrite the reduce
>>> method with the @override annotation?
>>> Does your reduce method use OutputCollector or Context for gathering
>>> outputs? If you are using current version, it has to be Context.
>>>
>>> The thing is: if you do NOT override the standart reduce function
>>> (identity) is used and this results ofc in the same number of tuples
>>> as you read as input.
>>>
>>> Good luck!
>>> Elmar
>>>
>>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>
>>> >>:
>>>
>>>  I think I have tracked down the problem to the point that each split
>>>> only contains one big key-value pair and a combiner is connected to a
>>>> map task. Please correct me if I'm wrong, but I assume each map task
>>>> takes one split and the combiner operates only on the key-value pairs
>>>> within one split. That's why the combiner has no effect in my case.
>>>> Is there a way to combine the mapper outputs of multiple splits
>>>> before they are sent off to the reducer?
>>>>
>>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>> <mailto:sigurd.spieckermann@**gmail.com <si...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>     Maybe one more note: the combiner and the reducer class are the
>>>>     same and in the reduce-phase the values get aggregated correctly.
>>>>     Why is this not happening in the combiner-phase?
>>>>
>>>>
>>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>>     <ma...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>         Hi guys,
>>>>
>>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>>         join-package. After running a job the result statistics show
>>>>         that my combiner has an input of 100 records and an output of
>>>>         100 records. From the task I'm running and the way it's
>>>>         implemented, I know that each key appears multiple times and
>>>>         the values should be combinable before getting passed to the
>>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>>         one or two map tasks. From using the debugger, I noticed that
>>>>         each key-value pair is processed by a combiner individually
>>>>         so there's actually no list passed into the combiner that it
>>>>         could aggregate. Can anyone think of a reason that causes
>>>>         this undesired behavior?
>>>>
>>>>         Thanks
>>>>         Sigurd
>>>>
>>>>
>>>>
>>>>
>>>
>>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Hi,

Could you provide an approximation of the data volumes you are dealing with?
If I understand correctly, your map tasks produce almost nothing (it
depends on the size of your key/value, I guess).
My questions are 1) is the combiner really useful in your context? 2) is
the reducer really useful in your context?

Back to your problem, when a map task is done, the sorted output can
already be send to the reducer.
So waiting for the node do to all its map tasks may not be the best
solution. Furthermore, you have to know how many tasks this node will have,
knowing which one is the last one is not obvious (unless you wait for
absolutely all map tasks to finish...) and if the node is lost, all
computations should be redone again... So doing a bigger combiner might not
be a best solution, at least if the general case is considered.

A solution might be to skip combiner and reducer, put the output of the map
tasks into a datastore and work on that. But it will depend on your
context, of course.

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:34 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> I'm not doing a conventional join, but in my case one split/file consists
> of only one key-value pair. I'm not using default mapper/reducer
> implementations. I'm guessing the problem is that a combiner is only
> applied to the output of a map task which is an instance of the mapper
> class, but one map task processes one split and since I only have one
> key-value pair per split, there is nothing to combine. What I would need is
> a combiner across multiple map tasks or a way to treat all splits of a
> datanode as one, hence there would only be one map task. Is there a way to
> do something like that? Reusing the JVM hasn't worked in my tests.
>
> Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
>
>> Ups, sorry. You are using standart implementations? I dont know whats
>> happening then. Sorry. But the fact, that your inputsize equals your
>> outputsize in a "join" process reminded me too much of my own problems.
>> Sorry for confusion, i may have caused.
>>
>> Best,
>> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
>> <mailto:macek@cs.uni-kassel.de**>>:
>>
>>  Hi,
>>>
>>> i had this problem once too. Did you properly overwrite the reduce
>>> method with the @override annotation?
>>> Does your reduce method use OutputCollector or Context for gathering
>>> outputs? If you are using current version, it has to be Context.
>>>
>>> The thing is: if you do NOT override the standart reduce function
>>> (identity) is used and this results ofc in the same number of tuples
>>> as you read as input.
>>>
>>> Good luck!
>>> Elmar
>>>
>>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>
>>> >>:
>>>
>>>  I think I have tracked down the problem to the point that each split
>>>> only contains one big key-value pair and a combiner is connected to a
>>>> map task. Please correct me if I'm wrong, but I assume each map task
>>>> takes one split and the combiner operates only on the key-value pairs
>>>> within one split. That's why the combiner has no effect in my case.
>>>> Is there a way to combine the mapper outputs of multiple splits
>>>> before they are sent off to the reducer?
>>>>
>>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>> <mailto:sigurd.spieckermann@**gmail.com <si...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>     Maybe one more note: the combiner and the reducer class are the
>>>>     same and in the reduce-phase the values get aggregated correctly.
>>>>     Why is this not happening in the combiner-phase?
>>>>
>>>>
>>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>>     <ma...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>         Hi guys,
>>>>
>>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>>         join-package. After running a job the result statistics show
>>>>         that my combiner has an input of 100 records and an output of
>>>>         100 records. From the task I'm running and the way it's
>>>>         implemented, I know that each key appears multiple times and
>>>>         the values should be combinable before getting passed to the
>>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>>         one or two map tasks. From using the debugger, I noticed that
>>>>         each key-value pair is processed by a combiner individually
>>>>         so there's actually no list passed into the combiner that it
>>>>         could aggregate. Can anyone think of a reason that causes
>>>>         this undesired behavior?
>>>>
>>>>         Thanks
>>>>         Sigurd
>>>>
>>>>
>>>>
>>>>
>>>
>>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Hi,

Could you provide an approximation of the data volumes you are dealing with?
If I understand correctly, your map tasks produce almost nothing (it
depends on the size of your key/value, I guess).
My questions are 1) is the combiner really useful in your context? 2) is
the reducer really useful in your context?

Back to your problem, when a map task is done, the sorted output can
already be send to the reducer.
So waiting for the node do to all its map tasks may not be the best
solution. Furthermore, you have to know how many tasks this node will have,
knowing which one is the last one is not obvious (unless you wait for
absolutely all map tasks to finish...) and if the node is lost, all
computations should be redone again... So doing a bigger combiner might not
be a best solution, at least if the general case is considered.

A solution might be to skip combiner and reducer, put the output of the map
tasks into a datastore and work on that. But it will depend on your
context, of course.

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:34 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> I'm not doing a conventional join, but in my case one split/file consists
> of only one key-value pair. I'm not using default mapper/reducer
> implementations. I'm guessing the problem is that a combiner is only
> applied to the output of a map task which is an instance of the mapper
> class, but one map task processes one split and since I only have one
> key-value pair per split, there is nothing to combine. What I would need is
> a combiner across multiple map tasks or a way to treat all splits of a
> datanode as one, hence there would only be one map task. Is there a way to
> do something like that? Reusing the JVM hasn't worked in my tests.
>
> Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
>
>> Ups, sorry. You are using standart implementations? I dont know whats
>> happening then. Sorry. But the fact, that your inputsize equals your
>> outputsize in a "join" process reminded me too much of my own problems.
>> Sorry for confusion, i may have caused.
>>
>> Best,
>> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
>> <mailto:macek@cs.uni-kassel.de**>>:
>>
>>  Hi,
>>>
>>> i had this problem once too. Did you properly overwrite the reduce
>>> method with the @override annotation?
>>> Does your reduce method use OutputCollector or Context for gathering
>>> outputs? If you are using current version, it has to be Context.
>>>
>>> The thing is: if you do NOT override the standart reduce function
>>> (identity) is used and this results ofc in the same number of tuples
>>> as you read as input.
>>>
>>> Good luck!
>>> Elmar
>>>
>>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>
>>> >>:
>>>
>>>  I think I have tracked down the problem to the point that each split
>>>> only contains one big key-value pair and a combiner is connected to a
>>>> map task. Please correct me if I'm wrong, but I assume each map task
>>>> takes one split and the combiner operates only on the key-value pairs
>>>> within one split. That's why the combiner has no effect in my case.
>>>> Is there a way to combine the mapper outputs of multiple splits
>>>> before they are sent off to the reducer?
>>>>
>>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>> <mailto:sigurd.spieckermann@**gmail.com <si...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>     Maybe one more note: the combiner and the reducer class are the
>>>>     same and in the reduce-phase the values get aggregated correctly.
>>>>     Why is this not happening in the combiner-phase?
>>>>
>>>>
>>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>>     <ma...@gmail.com>
>>>> >>
>>>>
>>>>
>>>>         Hi guys,
>>>>
>>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>>         join-package. After running a job the result statistics show
>>>>         that my combiner has an input of 100 records and an output of
>>>>         100 records. From the task I'm running and the way it's
>>>>         implemented, I know that each key appears multiple times and
>>>>         the values should be combinable before getting passed to the
>>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>>         one or two map tasks. From using the debugger, I noticed that
>>>>         each key-value pair is processed by a combiner individually
>>>>         so there's actually no list passed into the combiner that it
>>>>         could aggregate. Can anyone think of a reason that causes
>>>>         this undesired behavior?
>>>>
>>>>         Thanks
>>>>         Sigurd
>>>>
>>>>
>>>>
>>>>
>>>
>>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I'm not doing a conventional join, but in my case one split/file 
consists of only one key-value pair. I'm not using default 
mapper/reducer implementations. I'm guessing the problem is that a 
combiner is only applied to the output of a map task which is an 
instance of the mapper class, but one map task processes one split and 
since I only have one key-value pair per split, there is nothing to 
combine. What I would need is a combiner across multiple map tasks or a 
way to treat all splits of a datanode as one, hence there would only be 
one map task. Is there a way to do something like that? Reusing the JVM 
hasn't worked in my tests.

Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
> Ups, sorry. You are using standart implementations? I dont know whats
> happening then. Sorry. But the fact, that your inputsize equals your
> outputsize in a "join" process reminded me too much of my own problems.
> Sorry for confusion, i may have caused.
>
> Best,
> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
> <ma...@cs.uni-kassel.de>>:
>
>> Hi,
>>
>> i had this problem once too. Did you properly overwrite the reduce
>> method with the @override annotation?
>> Does your reduce method use OutputCollector or Context for gathering
>> outputs? If you are using current version, it has to be Context.
>>
>> The thing is: if you do NOT override the standart reduce function
>> (identity) is used and this results ofc in the same number of tuples
>> as you read as input.
>>
>> Good luck!
>> Elmar
>>
>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>>:
>>
>>> I think I have tracked down the problem to the point that each split
>>> only contains one big key-value pair and a combiner is connected to a
>>> map task. Please correct me if I'm wrong, but I assume each map task
>>> takes one split and the combiner operates only on the key-value pairs
>>> within one split. That's why the combiner has no effect in my case.
>>> Is there a way to combine the mapper outputs of multiple splits
>>> before they are sent off to the reducer?
>>>
>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>> <ma...@gmail.com>>
>>>
>>>     Maybe one more note: the combiner and the reducer class are the
>>>     same and in the reduce-phase the values get aggregated correctly.
>>>     Why is this not happening in the combiner-phase?
>>>
>>>
>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>     <ma...@gmail.com>>
>>>
>>>         Hi guys,
>>>
>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>         join-package. After running a job the result statistics show
>>>         that my combiner has an input of 100 records and an output of
>>>         100 records. From the task I'm running and the way it's
>>>         implemented, I know that each key appears multiple times and
>>>         the values should be combinable before getting passed to the
>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>         one or two map tasks. From using the debugger, I noticed that
>>>         each key-value pair is processed by a combiner individually
>>>         so there's actually no list passed into the combiner that it
>>>         could aggregate. Can anyone think of a reason that causes
>>>         this undesired behavior?
>>>
>>>         Thanks
>>>         Sigurd
>>>
>>>
>>>
>>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I'm not doing a conventional join, but in my case one split/file 
consists of only one key-value pair. I'm not using default 
mapper/reducer implementations. I'm guessing the problem is that a 
combiner is only applied to the output of a map task which is an 
instance of the mapper class, but one map task processes one split and 
since I only have one key-value pair per split, there is nothing to 
combine. What I would need is a combiner across multiple map tasks or a 
way to treat all splits of a datanode as one, hence there would only be 
one map task. Is there a way to do something like that? Reusing the JVM 
hasn't worked in my tests.

Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
> Ups, sorry. You are using standart implementations? I dont know whats
> happening then. Sorry. But the fact, that your inputsize equals your
> outputsize in a "join" process reminded me too much of my own problems.
> Sorry for confusion, i may have caused.
>
> Best,
> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
> <ma...@cs.uni-kassel.de>>:
>
>> Hi,
>>
>> i had this problem once too. Did you properly overwrite the reduce
>> method with the @override annotation?
>> Does your reduce method use OutputCollector or Context for gathering
>> outputs? If you are using current version, it has to be Context.
>>
>> The thing is: if you do NOT override the standart reduce function
>> (identity) is used and this results ofc in the same number of tuples
>> as you read as input.
>>
>> Good luck!
>> Elmar
>>
>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>>:
>>
>>> I think I have tracked down the problem to the point that each split
>>> only contains one big key-value pair and a combiner is connected to a
>>> map task. Please correct me if I'm wrong, but I assume each map task
>>> takes one split and the combiner operates only on the key-value pairs
>>> within one split. That's why the combiner has no effect in my case.
>>> Is there a way to combine the mapper outputs of multiple splits
>>> before they are sent off to the reducer?
>>>
>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>> <ma...@gmail.com>>
>>>
>>>     Maybe one more note: the combiner and the reducer class are the
>>>     same and in the reduce-phase the values get aggregated correctly.
>>>     Why is this not happening in the combiner-phase?
>>>
>>>
>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>     <ma...@gmail.com>>
>>>
>>>         Hi guys,
>>>
>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>         join-package. After running a job the result statistics show
>>>         that my combiner has an input of 100 records and an output of
>>>         100 records. From the task I'm running and the way it's
>>>         implemented, I know that each key appears multiple times and
>>>         the values should be combinable before getting passed to the
>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>         one or two map tasks. From using the debugger, I noticed that
>>>         each key-value pair is processed by a combiner individually
>>>         so there's actually no list passed into the combiner that it
>>>         could aggregate. Can anyone think of a reason that causes
>>>         this undesired behavior?
>>>
>>>         Thanks
>>>         Sigurd
>>>
>>>
>>>
>>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I'm not doing a conventional join, but in my case one split/file 
consists of only one key-value pair. I'm not using default 
mapper/reducer implementations. I'm guessing the problem is that a 
combiner is only applied to the output of a map task which is an 
instance of the mapper class, but one map task processes one split and 
since I only have one key-value pair per split, there is nothing to 
combine. What I would need is a combiner across multiple map tasks or a 
way to treat all splits of a datanode as one, hence there would only be 
one map task. Is there a way to do something like that? Reusing the JVM 
hasn't worked in my tests.

Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
> Ups, sorry. You are using standart implementations? I dont know whats
> happening then. Sorry. But the fact, that your inputsize equals your
> outputsize in a "join" process reminded me too much of my own problems.
> Sorry for confusion, i may have caused.
>
> Best,
> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
> <ma...@cs.uni-kassel.de>>:
>
>> Hi,
>>
>> i had this problem once too. Did you properly overwrite the reduce
>> method with the @override annotation?
>> Does your reduce method use OutputCollector or Context for gathering
>> outputs? If you are using current version, it has to be Context.
>>
>> The thing is: if you do NOT override the standart reduce function
>> (identity) is used and this results ofc in the same number of tuples
>> as you read as input.
>>
>> Good luck!
>> Elmar
>>
>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>>:
>>
>>> I think I have tracked down the problem to the point that each split
>>> only contains one big key-value pair and a combiner is connected to a
>>> map task. Please correct me if I'm wrong, but I assume each map task
>>> takes one split and the combiner operates only on the key-value pairs
>>> within one split. That's why the combiner has no effect in my case.
>>> Is there a way to combine the mapper outputs of multiple splits
>>> before they are sent off to the reducer?
>>>
>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>> <ma...@gmail.com>>
>>>
>>>     Maybe one more note: the combiner and the reducer class are the
>>>     same and in the reduce-phase the values get aggregated correctly.
>>>     Why is this not happening in the combiner-phase?
>>>
>>>
>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>     <ma...@gmail.com>>
>>>
>>>         Hi guys,
>>>
>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>         join-package. After running a job the result statistics show
>>>         that my combiner has an input of 100 records and an output of
>>>         100 records. From the task I'm running and the way it's
>>>         implemented, I know that each key appears multiple times and
>>>         the values should be combinable before getting passed to the
>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>         one or two map tasks. From using the debugger, I noticed that
>>>         each key-value pair is processed by a combiner individually
>>>         so there's actually no list passed into the combiner that it
>>>         could aggregate. Can anyone think of a reason that causes
>>>         this undesired behavior?
>>>
>>>         Thanks
>>>         Sigurd
>>>
>>>
>>>
>>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I'm not doing a conventional join, but in my case one split/file 
consists of only one key-value pair. I'm not using default 
mapper/reducer implementations. I'm guessing the problem is that a 
combiner is only applied to the output of a map task which is an 
instance of the mapper class, but one map task processes one split and 
since I only have one key-value pair per split, there is nothing to 
combine. What I would need is a combiner across multiple map tasks or a 
way to treat all splits of a datanode as one, hence there would only be 
one map task. Is there a way to do something like that? Reusing the JVM 
hasn't worked in my tests.

Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
> Ups, sorry. You are using standart implementations? I dont know whats
> happening then. Sorry. But the fact, that your inputsize equals your
> outputsize in a "join" process reminded me too much of my own problems.
> Sorry for confusion, i may have caused.
>
> Best,
> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
> <ma...@cs.uni-kassel.de>>:
>
>> Hi,
>>
>> i had this problem once too. Did you properly overwrite the reduce
>> method with the @override annotation?
>> Does your reduce method use OutputCollector or Context for gathering
>> outputs? If you are using current version, it has to be Context.
>>
>> The thing is: if you do NOT override the standart reduce function
>> (identity) is used and this results ofc in the same number of tuples
>> as you read as input.
>>
>> Good luck!
>> Elmar
>>
>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com <ma...@gmail.com>>:
>>
>>> I think I have tracked down the problem to the point that each split
>>> only contains one big key-value pair and a combiner is connected to a
>>> map task. Please correct me if I'm wrong, but I assume each map task
>>> takes one split and the combiner operates only on the key-value pairs
>>> within one split. That's why the combiner has no effect in my case.
>>> Is there a way to combine the mapper outputs of multiple splits
>>> before they are sent off to the reducer?
>>>
>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>> <ma...@gmail.com>>
>>>
>>>     Maybe one more note: the combiner and the reducer class are the
>>>     same and in the reduce-phase the values get aggregated correctly.
>>>     Why is this not happening in the combiner-phase?
>>>
>>>
>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>     <ma...@gmail.com>>
>>>
>>>         Hi guys,
>>>
>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>         join-package. After running a job the result statistics show
>>>         that my combiner has an input of 100 records and an output of
>>>         100 records. From the task I'm running and the way it's
>>>         implemented, I know that each key appears multiple times and
>>>         the values should be combinable before getting passed to the
>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>         one or two map tasks. From using the debugger, I noticed that
>>>         each key-value pair is processed by a combiner individually
>>>         so there's actually no list passed into the combiner that it
>>>         could aggregate. Can anyone think of a reason that causes
>>>         this undesired behavior?
>>>
>>>         Thanks
>>>         Sigurd
>>>
>>>
>>>
>>
>

Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ups, sorry. You are using standart implementations? I dont know whats happening then. Sorry. But the fact, that your inputsize equals your outputsize in a "join" process reminded me too much of my own problems. Sorry for confusion, i may have caused.

Best,
Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <ma...@cs.uni-kassel.de>:

> Hi,
> 
> i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.
> 
> The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.
> 
> Good luck!
> Elmar
> 
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:
> 
>> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
>> 
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Hi guys,
>> 
>> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
>> 
>> Thanks
>> Sigurd
>> 
>> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Out of curiosity : did you change the partitioner or the comparators? And
how did you implement the equals and hash code methods of your objects

Regards

Bertrand

On Tue, Sep 25, 2012 at 3:32 PM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i had this problem once too. Did you properly overwrite the reduce method
> with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering
> outputs? If you are using current version, it has to be Context.
>
> The thing is: if you do NOT override the standart reduce function
> (identity) is used and this results ofc in the same number of tuples as you
> read as input.
>
> Good luck!
> Elmar
>
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com>:
>
> I think I have tracked down the problem to the point that each split only
> contains one big key-value pair and a combiner is connected to a map task.
> Please correct me if I'm wrong, but I assume each map task takes one split
> and the combiner operates only on the key-value pairs within one split.
> That's why the combiner has no effect in my case. Is there a way to combine
> the mapper outputs of multiple splits before they are sent off to the
> reducer?
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Maybe one more note: the combiner and the reducer class are the same and
>> in the reduce-phase the values get aggregated correctly. Why is this not
>> happening in the combiner-phase?
>>
>>
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>>
>>> Hi guys,
>>>
>>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>>> After running a job the result statistics show that my combiner has an
>>> input of 100 records and an output of 100 records. From the task I'm
>>> running and the way it's implemented, I know that each key appears multiple
>>> times and the values should be combinable before getting passed to the
>>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>>> map tasks. From using the debugger, I noticed that each key-value pair is
>>> processed by a combiner individually so there's actually no list passed
>>> into the combiner that it could aggregate. Can anyone think of a reason
>>> that causes this undesired behavior?
>>>
>>> Thanks
>>> Sigurd
>>>
>>
>>
>
>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ups, sorry. You are using standart implementations? I dont know whats happening then. Sorry. But the fact, that your inputsize equals your outputsize in a "join" process reminded me too much of my own problems. Sorry for confusion, i may have caused.

Best,
Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <ma...@cs.uni-kassel.de>:

> Hi,
> 
> i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.
> 
> The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.
> 
> Good luck!
> Elmar
> 
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:
> 
>> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
>> 
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Hi guys,
>> 
>> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
>> 
>> Thanks
>> Sigurd
>> 
>> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Out of curiosity : did you change the partitioner or the comparators? And
how did you implement the equals and hash code methods of your objects

Regards

Bertrand

On Tue, Sep 25, 2012 at 3:32 PM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i had this problem once too. Did you properly overwrite the reduce method
> with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering
> outputs? If you are using current version, it has to be Context.
>
> The thing is: if you do NOT override the standart reduce function
> (identity) is used and this results ofc in the same number of tuples as you
> read as input.
>
> Good luck!
> Elmar
>
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com>:
>
> I think I have tracked down the problem to the point that each split only
> contains one big key-value pair and a combiner is connected to a map task.
> Please correct me if I'm wrong, but I assume each map task takes one split
> and the combiner operates only on the key-value pairs within one split.
> That's why the combiner has no effect in my case. Is there a way to combine
> the mapper outputs of multiple splits before they are sent off to the
> reducer?
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Maybe one more note: the combiner and the reducer class are the same and
>> in the reduce-phase the values get aggregated correctly. Why is this not
>> happening in the combiner-phase?
>>
>>
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>>
>>> Hi guys,
>>>
>>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>>> After running a job the result statistics show that my combiner has an
>>> input of 100 records and an output of 100 records. From the task I'm
>>> running and the way it's implemented, I know that each key appears multiple
>>> times and the values should be combinable before getting passed to the
>>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>>> map tasks. From using the debugger, I noticed that each key-value pair is
>>> processed by a combiner individually so there's actually no list passed
>>> into the combiner that it could aggregate. Can anyone think of a reason
>>> that causes this undesired behavior?
>>>
>>> Thanks
>>> Sigurd
>>>
>>
>>
>
>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ups, sorry. You are using standart implementations? I dont know whats happening then. Sorry. But the fact, that your inputsize equals your outputsize in a "join" process reminded me too much of my own problems. Sorry for confusion, i may have caused.

Best,
Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <ma...@cs.uni-kassel.de>:

> Hi,
> 
> i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.
> 
> The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.
> 
> Good luck!
> Elmar
> 
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:
> 
>> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
>> 
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Hi guys,
>> 
>> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
>> 
>> Thanks
>> Sigurd
>> 
>> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ups, sorry. You are using standart implementations? I dont know whats happening then. Sorry. But the fact, that your inputsize equals your outputsize in a "join" process reminded me too much of my own problems. Sorry for confusion, i may have caused.

Best,
Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <ma...@cs.uni-kassel.de>:

> Hi,
> 
> i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.
> 
> The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.
> 
> Good luck!
> Elmar
> 
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:
> 
>> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
>> 
>> 
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>> Hi guys,
>> 
>> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
>> 
>> Thanks
>> Sigurd
>> 
>> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Out of curiosity : did you change the partitioner or the comparators? And
how did you implement the equals and hash code methods of your objects

Regards

Bertrand

On Tue, Sep 25, 2012 at 3:32 PM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i had this problem once too. Did you properly overwrite the reduce method
> with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering
> outputs? If you are using current version, it has to be Context.
>
> The thing is: if you do NOT override the standart reduce function
> (identity) is used and this results ofc in the same number of tuples as you
> read as input.
>
> Good luck!
> Elmar
>
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com>:
>
> I think I have tracked down the problem to the point that each split only
> contains one big key-value pair and a combiner is connected to a map task.
> Please correct me if I'm wrong, but I assume each map task takes one split
> and the combiner operates only on the key-value pairs within one split.
> That's why the combiner has no effect in my case. Is there a way to combine
> the mapper outputs of multiple splits before they are sent off to the
> reducer?
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Maybe one more note: the combiner and the reducer class are the same and
>> in the reduce-phase the values get aggregated correctly. Why is this not
>> happening in the combiner-phase?
>>
>>
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>>
>>> Hi guys,
>>>
>>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>>> After running a job the result statistics show that my combiner has an
>>> input of 100 records and an output of 100 records. From the task I'm
>>> running and the way it's implemented, I know that each key appears multiple
>>> times and the values should be combinable before getting passed to the
>>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>>> map tasks. From using the debugger, I noticed that each key-value pair is
>>> processed by a combiner individually so there's actually no list passed
>>> into the combiner that it could aggregate. Can anyone think of a reason
>>> that causes this undesired behavior?
>>>
>>> Thanks
>>> Sigurd
>>>
>>
>>
>
>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Bertrand Dechoux <de...@gmail.com>.
Out of curiosity : did you change the partitioner or the comparators? And
how did you implement the equals and hash code methods of your objects

Regards

Bertrand

On Tue, Sep 25, 2012 at 3:32 PM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi,
>
> i had this problem once too. Did you properly overwrite the reduce method
> with the @override annotation?
> Does your reduce method use OutputCollector or Context for gathering
> outputs? If you are using current version, it has to be Context.
>
> The thing is: if you do NOT override the standart reduce function
> (identity) is used and this results ofc in the same number of tuples as you
> read as input.
>
> Good luck!
> Elmar
>
> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com>:
>
> I think I have tracked down the problem to the point that each split only
> contains one big key-value pair and a combiner is connected to a map task.
> Please correct me if I'm wrong, but I assume each map task takes one split
> and the combiner operates only on the key-value pairs within one split.
> That's why the combiner has no effect in my case. Is there a way to combine
> the mapper outputs of multiple splits before they are sent off to the
> reducer?
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Maybe one more note: the combiner and the reducer class are the same and
>> in the reduce-phase the values get aggregated correctly. Why is this not
>> happening in the combiner-phase?
>>
>>
>> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>>
>>> Hi guys,
>>>
>>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>>> After running a job the result statistics show that my combiner has an
>>> input of 100 records and an output of 100 records. From the task I'm
>>> running and the way it's implemented, I know that each key appears multiple
>>> times and the values should be combinable before getting passed to the
>>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>>> map tasks. From using the debugger, I noticed that each key-value pair is
>>> processed by a combiner individually so there's actually no list passed
>>> into the combiner that it could aggregate. Can anyone think of a reason
>>> that causes this undesired behavior?
>>>
>>> Thanks
>>> Sigurd
>>>
>>
>>
>
>


-- 
Bertrand Dechoux

Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi,

i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.

Good luck!
Elmar

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:

> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
> 
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Hi guys,
> 
> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
> 
> Thanks
> Sigurd
> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi,

i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.

Good luck!
Elmar

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:

> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
> 
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Hi guys,
> 
> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
> 
> Thanks
> Sigurd
> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi,

i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.

Good luck!
Elmar

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:

> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
> 
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Hi guys,
> 
> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
> 
> Thanks
> Sigurd
> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi,

i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.

Good luck!
Elmar

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <si...@gmail.com>:

> I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?
> 
> 
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
> Hi guys,
> 
> I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?
> 
> Thanks
> Sigurd
> 
> 


Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I think I have tracked down the problem to the point that each split only
contains one big key-value pair and a combiner is connected to a map task.
Please correct me if I'm wrong, but I assume each map task takes one split
and the combiner operates only on the key-value pairs within one split.
That's why the combiner has no effect in my case. Is there a way to combine
the mapper outputs of multiple splits before they are sent off to the
reducer?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Maybe one more note: the combiner and the reducer class are the same and
> in the reduce-phase the values get aggregated correctly. Why is this not
> happening in the combiner-phase?
>
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Hi guys,
>>
>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>> After running a job the result statistics show that my combiner has an
>> input of 100 records and an output of 100 records. From the task I'm
>> running and the way it's implemented, I know that each key appears multiple
>> times and the values should be combinable before getting passed to the
>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>> map tasks. From using the debugger, I noticed that each key-value pair is
>> processed by a combiner individually so there's actually no list passed
>> into the combiner that it could aggregate. Can anyone think of a reason
>> that causes this undesired behavior?
>>
>> Thanks
>> Sigurd
>>
>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I think I have tracked down the problem to the point that each split only
contains one big key-value pair and a combiner is connected to a map task.
Please correct me if I'm wrong, but I assume each map task takes one split
and the combiner operates only on the key-value pairs within one split.
That's why the combiner has no effect in my case. Is there a way to combine
the mapper outputs of multiple splits before they are sent off to the
reducer?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Maybe one more note: the combiner and the reducer class are the same and
> in the reduce-phase the values get aggregated correctly. Why is this not
> happening in the combiner-phase?
>
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Hi guys,
>>
>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>> After running a job the result statistics show that my combiner has an
>> input of 100 records and an output of 100 records. From the task I'm
>> running and the way it's implemented, I know that each key appears multiple
>> times and the values should be combinable before getting passed to the
>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>> map tasks. From using the debugger, I noticed that each key-value pair is
>> processed by a combiner individually so there's actually no list passed
>> into the combiner that it could aggregate. Can anyone think of a reason
>> that causes this undesired behavior?
>>
>> Thanks
>> Sigurd
>>
>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I think I have tracked down the problem to the point that each split only
contains one big key-value pair and a combiner is connected to a map task.
Please correct me if I'm wrong, but I assume each map task takes one split
and the combiner operates only on the key-value pairs within one split.
That's why the combiner has no effect in my case. Is there a way to combine
the mapper outputs of multiple splits before they are sent off to the
reducer?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Maybe one more note: the combiner and the reducer class are the same and
> in the reduce-phase the values get aggregated correctly. Why is this not
> happening in the combiner-phase?
>
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Hi guys,
>>
>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>> After running a job the result statistics show that my combiner has an
>> input of 100 records and an output of 100 records. From the task I'm
>> running and the way it's implemented, I know that each key appears multiple
>> times and the values should be combinable before getting passed to the
>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>> map tasks. From using the debugger, I noticed that each key-value pair is
>> processed by a combiner individually so there's actually no list passed
>> into the combiner that it could aggregate. Can anyone think of a reason
>> that causes this undesired behavior?
>>
>> Thanks
>> Sigurd
>>
>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
I think I have tracked down the problem to the point that each split only
contains one big key-value pair and a combiner is connected to a map task.
Please correct me if I'm wrong, but I assume each map task takes one split
and the combiner operates only on the key-value pairs within one split.
That's why the combiner has no effect in my case. Is there a way to combine
the mapper outputs of multiple splits before they are sent off to the
reducer?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Maybe one more note: the combiner and the reducer class are the same and
> in the reduce-phase the values get aggregated correctly. Why is this not
> happening in the combiner-phase?
>
>
> 2012/9/25 Sigurd Spieckermann <si...@gmail.com>
>
>> Hi guys,
>>
>> I'm experiencing a strange behavior when I use the Hadoop join-package.
>> After running a job the result statistics show that my combiner has an
>> input of 100 records and an output of 100 records. From the task I'm
>> running and the way it's implemented, I know that each key appears multiple
>> times and the values should be combinable before getting passed to the
>> reducer. I'm running my tests in pseudo-distributed mode with one or two
>> map tasks. From using the debugger, I noticed that each key-value pair is
>> processed by a combiner individually so there's actually no list passed
>> into the combiner that it could aggregate. Can anyone think of a reason
>> that causes this undesired behavior?
>>
>> Thanks
>> Sigurd
>>
>
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
Maybe one more note: the combiner and the reducer class are the same and in
the reduce-phase the values get aggregated correctly. Why is this not
happening in the combiner-phase?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Hi guys,
>
> I'm experiencing a strange behavior when I use the Hadoop join-package.
> After running a job the result statistics show that my combiner has an
> input of 100 records and an output of 100 records. From the task I'm
> running and the way it's implemented, I know that each key appears multiple
> times and the values should be combinable before getting passed to the
> reducer. I'm running my tests in pseudo-distributed mode with one or two
> map tasks. From using the debugger, I noticed that each key-value pair is
> processed by a combiner individually so there's actually no list passed
> into the combiner that it could aggregate. Can anyone think of a reason
> that causes this undesired behavior?
>
> Thanks
> Sigurd
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
Maybe one more note: the combiner and the reducer class are the same and in
the reduce-phase the values get aggregated correctly. Why is this not
happening in the combiner-phase?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Hi guys,
>
> I'm experiencing a strange behavior when I use the Hadoop join-package.
> After running a job the result statistics show that my combiner has an
> input of 100 records and an output of 100 records. From the task I'm
> running and the way it's implemented, I know that each key appears multiple
> times and the values should be combinable before getting passed to the
> reducer. I'm running my tests in pseudo-distributed mode with one or two
> map tasks. From using the debugger, I noticed that each key-value pair is
> processed by a combiner individually so there's actually no list passed
> into the combiner that it could aggregate. Can anyone think of a reason
> that causes this undesired behavior?
>
> Thanks
> Sigurd
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
Maybe one more note: the combiner and the reducer class are the same and in
the reduce-phase the values get aggregated correctly. Why is this not
happening in the combiner-phase?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Hi guys,
>
> I'm experiencing a strange behavior when I use the Hadoop join-package.
> After running a job the result statistics show that my combiner has an
> input of 100 records and an output of 100 records. From the task I'm
> running and the way it's implemented, I know that each key appears multiple
> times and the values should be combinable before getting passed to the
> reducer. I'm running my tests in pseudo-distributed mode with one or two
> map tasks. From using the debugger, I noticed that each key-value pair is
> processed by a combiner individually so there's actually no list passed
> into the combiner that it could aggregate. Can anyone think of a reason
> that causes this undesired behavior?
>
> Thanks
> Sigurd
>

Re: Join-package combiner number of input and output records the same

Posted by Sigurd Spieckermann <si...@gmail.com>.
Maybe one more note: the combiner and the reducer class are the same and in
the reduce-phase the values get aggregated correctly. Why is this not
happening in the combiner-phase?

2012/9/25 Sigurd Spieckermann <si...@gmail.com>

> Hi guys,
>
> I'm experiencing a strange behavior when I use the Hadoop join-package.
> After running a job the result statistics show that my combiner has an
> input of 100 records and an output of 100 records. From the task I'm
> running and the way it's implemented, I know that each key appears multiple
> times and the values should be combinable before getting passed to the
> reducer. I'm running my tests in pseudo-distributed mode with one or two
> map tasks. From using the debugger, I noticed that each key-value pair is
> processed by a combiner individually so there's actually no list passed
> into the combiner that it could aggregate. Can anyone think of a reason
> that causes this undesired behavior?
>
> Thanks
> Sigurd
>