You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Peter Ruch <ru...@gmail.com> on 2015/05/11 19:08:50 UTC

Filtering by value in Reducer

Hi,

I am currently playing around with Hadoop and have some problems when
trying to filter in the Reducer.

I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
some additional functionality
and added the possibility to filter by the specific value of each key -
e.g. only output the key-value pairs where [[ value > threshold ]].

Filtering Code in Reducer
#####################################

for (IntWritable val : values) {
     sum += val.get();
}
if ( sum > threshold ) {
     result.set(sum);
     context.write(key, result);
}

#####################################

For threshold smaller any value the above code works as expected and the
output contains all key-value pairs.
If I increase the threshold to 1 some pairs are missing in the output
although the respective value would be larger than the threshold.

I tried to work out the error myself, but I could not get it to work as
intended. I use the exact Tutorial setup with Oracle JDK 8
on a CentOS 7 machine.

As far as I understand the respective Iterable<...>  in the Reducer already
contains all the observed values for a specific key.
Why is it possible that I am missing some of these key-value pairs then? It
only fails in very few cases. The input file is pretty large - 250 MB -
so I also tried to increase the memory for the mapping and reduction steps
but it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / is more experienced
than I am.


Thank you,

Peter

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

Did you try mapreduce local mode with smaller input data? Or write test
case with MRUnit is very helpful for debugging.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 11:23 PM, Peter Ruch <ru...@gmail.com>
wrote:

>  Hi,
>
> No, I did not create any custom logs, I was only looking through the
> "standard" logs.
> I just started out with Hadoop and did not think of explicitly logging
> that part of the code,
> as I thought that I am simply missing a small detail that someone of you
> might spot.
>
> But I will definitely look into the custom logging and post my findings.
>
> @ Shahab and Drake: Thank you very much for your help.
>
>
> Best,
> Peter
>
>
>
> On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around the
> code that compares and then outputs the values? Maybe that will give you a
> clue that what is happening? Debug the threshold value that you get in the
> reducer and whether that is what you have set or not (in case of when you
> set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables though I
> doubt that would make any difference.
>
> Shahab
> On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:
>
>>  Hi,
>>
>> I already skimmed through the logs but I could not find anything special.
>>
>> I am just really confused why I am having this problem.
>>
>> If the Iterable<...> for a specific key contains all of the observed
>> values - and it seems to do so
>> otherwise the program wouldn't work correctly in the standard case with
>> [[ threshold = -1 ]] -
>> it should also work when I only write the key-value pairs to the output
>> file that suffice the condition [[ sum > threshold ]].
>>
>> Did I miss something? Maybe I have to handle these cases in a specific
>> way, but I did not find anything about that online.
>>
>>
>> Thank you for your help,
>>
>> Peter
>>
>>
>>
>> On 12.05.2015 12:35, Drake민영근 wrote:
>>
>> Hi, Peter
>>
>>  The missing records, they are just gone without no logs? How about your
>> reduce tasks logs?
>>
>>  Thanks
>>
>>   Drake 민영근 Ph.D
>> kt NexR
>>
>> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>> sum and threshold are both Integers.
>>> for the threshold variable I first add a new resource to the
>>> configuration - conf.addResource( ... );
>>>
>>> later I get the threshold value from the configuration.
>>>
>>> Code
>>> #####################################
>>>
>>> private int threshold;
>>>
>>> public void setup( Context context ) {
>>>
>>>           Configuration conf = context.getConfiguration();
>>>           threshold = conf.getInt( "threshold", -1 );
>>>
>>> }
>>>
>>> #####################################
>>>
>>>
>>> Best,
>>> Peter
>>>
>>>
>>>
>>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>>
>>> What is the type of the threshold variable? sum I believe is a Java int.
>>>
>>>  Regards,
>>> Shahab
>>>
>>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>>> wrote:
>>>
>>>>   Hi,
>>>>
>>>>  I am currently playing around with Hadoop and have some problems when
>>>> trying to filter in the Reducer.
>>>>
>>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>>> with some additional functionality
>>>> and added the possibility to filter by the specific value of each key -
>>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>>
>>>>  Filtering Code in Reducer
>>>>  #####################################
>>>>
>>>>  for (IntWritable val : values) {
>>>>      sum += val.get();
>>>> }
>>>> if ( sum > threshold ) {
>>>>      result.set(sum);
>>>>      context.write(key, result);
>>>> }
>>>>
>>>> #####################################
>>>>
>>>>  For threshold smaller any value the above code works as expected and
>>>> the output contains all key-value pairs.
>>>>  If I increase the threshold to 1 some pairs are missing in the output
>>>> although the respective value would be larger than the threshold.
>>>>
>>>>  I tried to work out the error myself, but I could not get it to work
>>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>>  on a CentOS 7 machine.
>>>>
>>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>>> already contains all the observed values for a specific key.
>>>>  Why is it possible that I am missing some of these key-value pairs
>>>> then? It only fails in very few cases. The input file is pretty large - 250
>>>> MB -
>>>>  so I also tried to increase the memory for the mapping and reduction
>>>> steps but it did not help ( tried a lot of different stuff without success )
>>>>
>>>>  Maybe someone already experienced similar problems / is more
>>>> experienced than I am.
>>>>
>>>>
>>>>  Thank you,
>>>>
>>>>  Peter
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

Did you try mapreduce local mode with smaller input data? Or write test
case with MRUnit is very helpful for debugging.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 11:23 PM, Peter Ruch <ru...@gmail.com>
wrote:

>  Hi,
>
> No, I did not create any custom logs, I was only looking through the
> "standard" logs.
> I just started out with Hadoop and did not think of explicitly logging
> that part of the code,
> as I thought that I am simply missing a small detail that someone of you
> might spot.
>
> But I will definitely look into the custom logging and post my findings.
>
> @ Shahab and Drake: Thank you very much for your help.
>
>
> Best,
> Peter
>
>
>
> On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around the
> code that compares and then outputs the values? Maybe that will give you a
> clue that what is happening? Debug the threshold value that you get in the
> reducer and whether that is what you have set or not (in case of when you
> set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables though I
> doubt that would make any difference.
>
> Shahab
> On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:
>
>>  Hi,
>>
>> I already skimmed through the logs but I could not find anything special.
>>
>> I am just really confused why I am having this problem.
>>
>> If the Iterable<...> for a specific key contains all of the observed
>> values - and it seems to do so
>> otherwise the program wouldn't work correctly in the standard case with
>> [[ threshold = -1 ]] -
>> it should also work when I only write the key-value pairs to the output
>> file that suffice the condition [[ sum > threshold ]].
>>
>> Did I miss something? Maybe I have to handle these cases in a specific
>> way, but I did not find anything about that online.
>>
>>
>> Thank you for your help,
>>
>> Peter
>>
>>
>>
>> On 12.05.2015 12:35, Drake민영근 wrote:
>>
>> Hi, Peter
>>
>>  The missing records, they are just gone without no logs? How about your
>> reduce tasks logs?
>>
>>  Thanks
>>
>>   Drake 민영근 Ph.D
>> kt NexR
>>
>> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>> sum and threshold are both Integers.
>>> for the threshold variable I first add a new resource to the
>>> configuration - conf.addResource( ... );
>>>
>>> later I get the threshold value from the configuration.
>>>
>>> Code
>>> #####################################
>>>
>>> private int threshold;
>>>
>>> public void setup( Context context ) {
>>>
>>>           Configuration conf = context.getConfiguration();
>>>           threshold = conf.getInt( "threshold", -1 );
>>>
>>> }
>>>
>>> #####################################
>>>
>>>
>>> Best,
>>> Peter
>>>
>>>
>>>
>>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>>
>>> What is the type of the threshold variable? sum I believe is a Java int.
>>>
>>>  Regards,
>>> Shahab
>>>
>>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>>> wrote:
>>>
>>>>   Hi,
>>>>
>>>>  I am currently playing around with Hadoop and have some problems when
>>>> trying to filter in the Reducer.
>>>>
>>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>>> with some additional functionality
>>>> and added the possibility to filter by the specific value of each key -
>>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>>
>>>>  Filtering Code in Reducer
>>>>  #####################################
>>>>
>>>>  for (IntWritable val : values) {
>>>>      sum += val.get();
>>>> }
>>>> if ( sum > threshold ) {
>>>>      result.set(sum);
>>>>      context.write(key, result);
>>>> }
>>>>
>>>> #####################################
>>>>
>>>>  For threshold smaller any value the above code works as expected and
>>>> the output contains all key-value pairs.
>>>>  If I increase the threshold to 1 some pairs are missing in the output
>>>> although the respective value would be larger than the threshold.
>>>>
>>>>  I tried to work out the error myself, but I could not get it to work
>>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>>  on a CentOS 7 machine.
>>>>
>>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>>> already contains all the observed values for a specific key.
>>>>  Why is it possible that I am missing some of these key-value pairs
>>>> then? It only fails in very few cases. The input file is pretty large - 250
>>>> MB -
>>>>  so I also tried to increase the memory for the mapping and reduction
>>>> steps but it did not help ( tried a lot of different stuff without success )
>>>>
>>>>  Maybe someone already experienced similar problems / is more
>>>> experienced than I am.
>>>>
>>>>
>>>>  Thank you,
>>>>
>>>>  Peter
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

Did you try mapreduce local mode with smaller input data? Or write test
case with MRUnit is very helpful for debugging.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 11:23 PM, Peter Ruch <ru...@gmail.com>
wrote:

>  Hi,
>
> No, I did not create any custom logs, I was only looking through the
> "standard" logs.
> I just started out with Hadoop and did not think of explicitly logging
> that part of the code,
> as I thought that I am simply missing a small detail that someone of you
> might spot.
>
> But I will definitely look into the custom logging and post my findings.
>
> @ Shahab and Drake: Thank you very much for your help.
>
>
> Best,
> Peter
>
>
>
> On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around the
> code that compares and then outputs the values? Maybe that will give you a
> clue that what is happening? Debug the threshold value that you get in the
> reducer and whether that is what you have set or not (in case of when you
> set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables though I
> doubt that would make any difference.
>
> Shahab
> On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:
>
>>  Hi,
>>
>> I already skimmed through the logs but I could not find anything special.
>>
>> I am just really confused why I am having this problem.
>>
>> If the Iterable<...> for a specific key contains all of the observed
>> values - and it seems to do so
>> otherwise the program wouldn't work correctly in the standard case with
>> [[ threshold = -1 ]] -
>> it should also work when I only write the key-value pairs to the output
>> file that suffice the condition [[ sum > threshold ]].
>>
>> Did I miss something? Maybe I have to handle these cases in a specific
>> way, but I did not find anything about that online.
>>
>>
>> Thank you for your help,
>>
>> Peter
>>
>>
>>
>> On 12.05.2015 12:35, Drake민영근 wrote:
>>
>> Hi, Peter
>>
>>  The missing records, they are just gone without no logs? How about your
>> reduce tasks logs?
>>
>>  Thanks
>>
>>   Drake 민영근 Ph.D
>> kt NexR
>>
>> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>> sum and threshold are both Integers.
>>> for the threshold variable I first add a new resource to the
>>> configuration - conf.addResource( ... );
>>>
>>> later I get the threshold value from the configuration.
>>>
>>> Code
>>> #####################################
>>>
>>> private int threshold;
>>>
>>> public void setup( Context context ) {
>>>
>>>           Configuration conf = context.getConfiguration();
>>>           threshold = conf.getInt( "threshold", -1 );
>>>
>>> }
>>>
>>> #####################################
>>>
>>>
>>> Best,
>>> Peter
>>>
>>>
>>>
>>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>>
>>> What is the type of the threshold variable? sum I believe is a Java int.
>>>
>>>  Regards,
>>> Shahab
>>>
>>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>>> wrote:
>>>
>>>>   Hi,
>>>>
>>>>  I am currently playing around with Hadoop and have some problems when
>>>> trying to filter in the Reducer.
>>>>
>>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>>> with some additional functionality
>>>> and added the possibility to filter by the specific value of each key -
>>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>>
>>>>  Filtering Code in Reducer
>>>>  #####################################
>>>>
>>>>  for (IntWritable val : values) {
>>>>      sum += val.get();
>>>> }
>>>> if ( sum > threshold ) {
>>>>      result.set(sum);
>>>>      context.write(key, result);
>>>> }
>>>>
>>>> #####################################
>>>>
>>>>  For threshold smaller any value the above code works as expected and
>>>> the output contains all key-value pairs.
>>>>  If I increase the threshold to 1 some pairs are missing in the output
>>>> although the respective value would be larger than the threshold.
>>>>
>>>>  I tried to work out the error myself, but I could not get it to work
>>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>>  on a CentOS 7 machine.
>>>>
>>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>>> already contains all the observed values for a specific key.
>>>>  Why is it possible that I am missing some of these key-value pairs
>>>> then? It only fails in very few cases. The input file is pretty large - 250
>>>> MB -
>>>>  so I also tried to increase the memory for the mapping and reduction
>>>> steps but it did not help ( tried a lot of different stuff without success )
>>>>
>>>>  Maybe someone already experienced similar problems / is more
>>>> experienced than I am.
>>>>
>>>>
>>>>  Thank you,
>>>>
>>>>  Peter
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

Did you try mapreduce local mode with smaller input data? Or write test
case with MRUnit is very helpful for debugging.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 11:23 PM, Peter Ruch <ru...@gmail.com>
wrote:

>  Hi,
>
> No, I did not create any custom logs, I was only looking through the
> "standard" logs.
> I just started out with Hadoop and did not think of explicitly logging
> that part of the code,
> as I thought that I am simply missing a small detail that someone of you
> might spot.
>
> But I will definitely look into the custom logging and post my findings.
>
> @ Shahab and Drake: Thank you very much for your help.
>
>
> Best,
> Peter
>
>
>
> On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around the
> code that compares and then outputs the values? Maybe that will give you a
> clue that what is happening? Debug the threshold value that you get in the
> reducer and whether that is what you have set or not (in case of when you
> set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables though I
> doubt that would make any difference.
>
> Shahab
> On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:
>
>>  Hi,
>>
>> I already skimmed through the logs but I could not find anything special.
>>
>> I am just really confused why I am having this problem.
>>
>> If the Iterable<...> for a specific key contains all of the observed
>> values - and it seems to do so
>> otherwise the program wouldn't work correctly in the standard case with
>> [[ threshold = -1 ]] -
>> it should also work when I only write the key-value pairs to the output
>> file that suffice the condition [[ sum > threshold ]].
>>
>> Did I miss something? Maybe I have to handle these cases in a specific
>> way, but I did not find anything about that online.
>>
>>
>> Thank you for your help,
>>
>> Peter
>>
>>
>>
>> On 12.05.2015 12:35, Drake민영근 wrote:
>>
>> Hi, Peter
>>
>>  The missing records, they are just gone without no logs? How about your
>> reduce tasks logs?
>>
>>  Thanks
>>
>>   Drake 민영근 Ph.D
>> kt NexR
>>
>> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>> sum and threshold are both Integers.
>>> for the threshold variable I first add a new resource to the
>>> configuration - conf.addResource( ... );
>>>
>>> later I get the threshold value from the configuration.
>>>
>>> Code
>>> #####################################
>>>
>>> private int threshold;
>>>
>>> public void setup( Context context ) {
>>>
>>>           Configuration conf = context.getConfiguration();
>>>           threshold = conf.getInt( "threshold", -1 );
>>>
>>> }
>>>
>>> #####################################
>>>
>>>
>>> Best,
>>> Peter
>>>
>>>
>>>
>>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>>
>>> What is the type of the threshold variable? sum I believe is a Java int.
>>>
>>>  Regards,
>>> Shahab
>>>
>>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>>> wrote:
>>>
>>>>   Hi,
>>>>
>>>>  I am currently playing around with Hadoop and have some problems when
>>>> trying to filter in the Reducer.
>>>>
>>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>>> with some additional functionality
>>>> and added the possibility to filter by the specific value of each key -
>>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>>
>>>>  Filtering Code in Reducer
>>>>  #####################################
>>>>
>>>>  for (IntWritable val : values) {
>>>>      sum += val.get();
>>>> }
>>>> if ( sum > threshold ) {
>>>>      result.set(sum);
>>>>      context.write(key, result);
>>>> }
>>>>
>>>> #####################################
>>>>
>>>>  For threshold smaller any value the above code works as expected and
>>>> the output contains all key-value pairs.
>>>>  If I increase the threshold to 1 some pairs are missing in the output
>>>> although the respective value would be larger than the threshold.
>>>>
>>>>  I tried to work out the error myself, but I could not get it to work
>>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>>  on a CentOS 7 machine.
>>>>
>>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>>> already contains all the observed values for a specific key.
>>>>  Why is it possible that I am missing some of these key-value pairs
>>>> then? It only fails in very few cases. The input file is pretty large - 250
>>>> MB -
>>>>  so I also tried to increase the memory for the mapping and reduction
>>>> steps but it did not help ( tried a lot of different stuff without success )
>>>>
>>>>  Maybe someone already experienced similar problems / is more
>>>> experienced than I am.
>>>>
>>>>
>>>>  Thank you,
>>>>
>>>>  Peter
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

No, I did not create any custom logs, I was only looking through the 
"standard" logs.
I just started out with Hadoop and did not think of explicitly logging 
that part of the code,
as I thought that I am simply missing a small detail that someone of you 
might spot.

But I will definitely look into the custom logging and post my findings.

@ Shahab and Drake: Thank you very much for your help.


Best,
Peter


On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around 
> the code that compares and then outputs the values? Maybe that will 
> give you a clue that what is happening? Debug the threshold value that 
> you get in the reducer and whether that is what you have set or not 
> (in case of when you set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables 
> though I doubt that would make any difference.
>
> Shahab
>
> On May 12, 2015 8:17 AM, "Peter Ruch" <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I already skimmed through the logs but I could not find anything
>     special.
>
>     I am just really confused why I am having this problem.
>
>     If the Iterable<...> for a specific key contains all of the
>     observed values - and it seems to do so
>     otherwise the program wouldn't work correctly in the standard case
>     with [[ threshold = -1 ]] -
>     it should also work when I only write the key-value pairs to the
>     output file that suffice the condition [[ sum > threshold ]].
>
>     Did I miss something? Maybe I have to handle these cases in a
>     specific way, but I did not find anything about that online.
>
>
>     Thank you for your help,
>
>     Peter
>
>
>
>     On 12.05.2015 12:35, Drake민영근 wrote:
>>     Hi, Peter
>>
>>     The missing records, they are just gone without no logs? How
>>     about your reduce tasks logs?
>>
>>     Thanks
>>
>>     Drake 민영근 Ph.D
>>     kt NexR
>>
>>     On Tue, May 12, 2015 at 5:18 AM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hello,
>>
>>         sum and threshold are both Integers.
>>         for the threshold variable I first add a new resource to the
>>         configuration - conf.addResource( ... );
>>
>>         later I get the threshold value from the configuration.
>>
>>         Code
>>         #####################################
>>
>>         private int threshold;
>>
>>         public void setup( Context context ) {
>>
>>                   Configuration conf = context.getConfiguration();
>>                   threshold = conf.getInt( "threshold", -1 );
>>
>>         }
>>
>>         #####################################
>>
>>
>>         Best,
>>         Peter
>>
>>
>>
>>         On 11.05.2015 19:26, Shahab Yunus wrote:
>>>         What is the type of the threshold variable? sum I believe is
>>>         a Java int.
>>>
>>>         Regards,
>>>         Shahab
>>>
>>>         On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>>         <rutschifengga@gmail.com <ma...@gmail.com>>
>>>         wrote:
>>>
>>>             Hi,
>>>
>>>             I am currently playing around with Hadoop and have some
>>>             problems when trying to filter in the Reducer.
>>>
>>>             I extended the WordCount v1.0 example from the 2.7
>>>             MapReduce Tutorial with some additional functionality
>>>             and added the possibility to filter by the specific
>>>             value of each key - e.g. only output the key-value pairs
>>>             where [[ value > threshold ]].
>>>
>>>             Filtering Code in Reducer
>>>             #####################################
>>>
>>>             for (IntWritable val : values) {
>>>                  sum += val.get();
>>>             }
>>>             if ( sum > threshold ) {
>>>                  result.set(sum);
>>>                  context.write(key, result);
>>>             }
>>>
>>>             #####################################
>>>
>>>             For threshold smaller any value the above code works as
>>>             expected and the output contains all key-value pairs.
>>>             If I increase the threshold to 1 some pairs are missing
>>>             in the output although the respective value would be
>>>             larger than the threshold.
>>>
>>>             I tried to work out the error myself, but I could not
>>>             get it to work as intended. I use the exact Tutorial
>>>             setup with Oracle JDK 8
>>>             on a CentOS 7 machine.
>>>
>>>             As far as I understand the respective Iterable<...>  in
>>>             the Reducer already contains all the observed values for
>>>             a specific key.
>>>             Why is it possible that I am missing some of these
>>>             key-value pairs then? It only fails in very few cases.
>>>             The input file is pretty large - 250 MB -
>>>             so I also tried to increase the memory for the mapping
>>>             and reduction steps but it did not help ( tried a lot of
>>>             different stuff without success )
>>>
>>>             Maybe someone already experienced similar problems / is
>>>             more experienced than I am.
>>>
>>>
>>>             Thank you,
>>>
>>>             Peter
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

No, I did not create any custom logs, I was only looking through the 
"standard" logs.
I just started out with Hadoop and did not think of explicitly logging 
that part of the code,
as I thought that I am simply missing a small detail that someone of you 
might spot.

But I will definitely look into the custom logging and post my findings.

@ Shahab and Drake: Thank you very much for your help.


Best,
Peter


On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around 
> the code that compares and then outputs the values? Maybe that will 
> give you a clue that what is happening? Debug the threshold value that 
> you get in the reducer and whether that is what you have set or not 
> (in case of when you set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables 
> though I doubt that would make any difference.
>
> Shahab
>
> On May 12, 2015 8:17 AM, "Peter Ruch" <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I already skimmed through the logs but I could not find anything
>     special.
>
>     I am just really confused why I am having this problem.
>
>     If the Iterable<...> for a specific key contains all of the
>     observed values - and it seems to do so
>     otherwise the program wouldn't work correctly in the standard case
>     with [[ threshold = -1 ]] -
>     it should also work when I only write the key-value pairs to the
>     output file that suffice the condition [[ sum > threshold ]].
>
>     Did I miss something? Maybe I have to handle these cases in a
>     specific way, but I did not find anything about that online.
>
>
>     Thank you for your help,
>
>     Peter
>
>
>
>     On 12.05.2015 12:35, Drake민영근 wrote:
>>     Hi, Peter
>>
>>     The missing records, they are just gone without no logs? How
>>     about your reduce tasks logs?
>>
>>     Thanks
>>
>>     Drake 민영근 Ph.D
>>     kt NexR
>>
>>     On Tue, May 12, 2015 at 5:18 AM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hello,
>>
>>         sum and threshold are both Integers.
>>         for the threshold variable I first add a new resource to the
>>         configuration - conf.addResource( ... );
>>
>>         later I get the threshold value from the configuration.
>>
>>         Code
>>         #####################################
>>
>>         private int threshold;
>>
>>         public void setup( Context context ) {
>>
>>                   Configuration conf = context.getConfiguration();
>>                   threshold = conf.getInt( "threshold", -1 );
>>
>>         }
>>
>>         #####################################
>>
>>
>>         Best,
>>         Peter
>>
>>
>>
>>         On 11.05.2015 19:26, Shahab Yunus wrote:
>>>         What is the type of the threshold variable? sum I believe is
>>>         a Java int.
>>>
>>>         Regards,
>>>         Shahab
>>>
>>>         On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>>         <rutschifengga@gmail.com <ma...@gmail.com>>
>>>         wrote:
>>>
>>>             Hi,
>>>
>>>             I am currently playing around with Hadoop and have some
>>>             problems when trying to filter in the Reducer.
>>>
>>>             I extended the WordCount v1.0 example from the 2.7
>>>             MapReduce Tutorial with some additional functionality
>>>             and added the possibility to filter by the specific
>>>             value of each key - e.g. only output the key-value pairs
>>>             where [[ value > threshold ]].
>>>
>>>             Filtering Code in Reducer
>>>             #####################################
>>>
>>>             for (IntWritable val : values) {
>>>                  sum += val.get();
>>>             }
>>>             if ( sum > threshold ) {
>>>                  result.set(sum);
>>>                  context.write(key, result);
>>>             }
>>>
>>>             #####################################
>>>
>>>             For threshold smaller any value the above code works as
>>>             expected and the output contains all key-value pairs.
>>>             If I increase the threshold to 1 some pairs are missing
>>>             in the output although the respective value would be
>>>             larger than the threshold.
>>>
>>>             I tried to work out the error myself, but I could not
>>>             get it to work as intended. I use the exact Tutorial
>>>             setup with Oracle JDK 8
>>>             on a CentOS 7 machine.
>>>
>>>             As far as I understand the respective Iterable<...>  in
>>>             the Reducer already contains all the observed values for
>>>             a specific key.
>>>             Why is it possible that I am missing some of these
>>>             key-value pairs then? It only fails in very few cases.
>>>             The input file is pretty large - 250 MB -
>>>             so I also tried to increase the memory for the mapping
>>>             and reduction steps but it did not help ( tried a lot of
>>>             different stuff without success )
>>>
>>>             Maybe someone already experienced similar problems / is
>>>             more experienced than I am.
>>>
>>>
>>>             Thank you,
>>>
>>>             Peter
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

No, I did not create any custom logs, I was only looking through the 
"standard" logs.
I just started out with Hadoop and did not think of explicitly logging 
that part of the code,
as I thought that I am simply missing a small detail that someone of you 
might spot.

But I will definitely look into the custom logging and post my findings.

@ Shahab and Drake: Thank you very much for your help.


Best,
Peter


On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around 
> the code that compares and then outputs the values? Maybe that will 
> give you a clue that what is happening? Debug the threshold value that 
> you get in the reducer and whether that is what you have set or not 
> (in case of when you set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables 
> though I doubt that would make any difference.
>
> Shahab
>
> On May 12, 2015 8:17 AM, "Peter Ruch" <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I already skimmed through the logs but I could not find anything
>     special.
>
>     I am just really confused why I am having this problem.
>
>     If the Iterable<...> for a specific key contains all of the
>     observed values - and it seems to do so
>     otherwise the program wouldn't work correctly in the standard case
>     with [[ threshold = -1 ]] -
>     it should also work when I only write the key-value pairs to the
>     output file that suffice the condition [[ sum > threshold ]].
>
>     Did I miss something? Maybe I have to handle these cases in a
>     specific way, but I did not find anything about that online.
>
>
>     Thank you for your help,
>
>     Peter
>
>
>
>     On 12.05.2015 12:35, Drake민영근 wrote:
>>     Hi, Peter
>>
>>     The missing records, they are just gone without no logs? How
>>     about your reduce tasks logs?
>>
>>     Thanks
>>
>>     Drake 민영근 Ph.D
>>     kt NexR
>>
>>     On Tue, May 12, 2015 at 5:18 AM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hello,
>>
>>         sum and threshold are both Integers.
>>         for the threshold variable I first add a new resource to the
>>         configuration - conf.addResource( ... );
>>
>>         later I get the threshold value from the configuration.
>>
>>         Code
>>         #####################################
>>
>>         private int threshold;
>>
>>         public void setup( Context context ) {
>>
>>                   Configuration conf = context.getConfiguration();
>>                   threshold = conf.getInt( "threshold", -1 );
>>
>>         }
>>
>>         #####################################
>>
>>
>>         Best,
>>         Peter
>>
>>
>>
>>         On 11.05.2015 19:26, Shahab Yunus wrote:
>>>         What is the type of the threshold variable? sum I believe is
>>>         a Java int.
>>>
>>>         Regards,
>>>         Shahab
>>>
>>>         On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>>         <rutschifengga@gmail.com <ma...@gmail.com>>
>>>         wrote:
>>>
>>>             Hi,
>>>
>>>             I am currently playing around with Hadoop and have some
>>>             problems when trying to filter in the Reducer.
>>>
>>>             I extended the WordCount v1.0 example from the 2.7
>>>             MapReduce Tutorial with some additional functionality
>>>             and added the possibility to filter by the specific
>>>             value of each key - e.g. only output the key-value pairs
>>>             where [[ value > threshold ]].
>>>
>>>             Filtering Code in Reducer
>>>             #####################################
>>>
>>>             for (IntWritable val : values) {
>>>                  sum += val.get();
>>>             }
>>>             if ( sum > threshold ) {
>>>                  result.set(sum);
>>>                  context.write(key, result);
>>>             }
>>>
>>>             #####################################
>>>
>>>             For threshold smaller any value the above code works as
>>>             expected and the output contains all key-value pairs.
>>>             If I increase the threshold to 1 some pairs are missing
>>>             in the output although the respective value would be
>>>             larger than the threshold.
>>>
>>>             I tried to work out the error myself, but I could not
>>>             get it to work as intended. I use the exact Tutorial
>>>             setup with Oracle JDK 8
>>>             on a CentOS 7 machine.
>>>
>>>             As far as I understand the respective Iterable<...>  in
>>>             the Reducer already contains all the observed values for
>>>             a specific key.
>>>             Why is it possible that I am missing some of these
>>>             key-value pairs then? It only fails in very few cases.
>>>             The input file is pretty large - 250 MB -
>>>             so I also tried to increase the memory for the mapping
>>>             and reduction steps but it did not help ( tried a lot of
>>>             different stuff without success )
>>>
>>>             Maybe someone already experienced similar problems / is
>>>             more experienced than I am.
>>>
>>>
>>>             Thank you,
>>>
>>>             Peter
>>>
>>>
>>
>>
>

Re: Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

No, I did not create any custom logs, I was only looking through the 
"standard" logs.
I just started out with Hadoop and did not think of explicitly logging 
that part of the code,
as I thought that I am simply missing a small detail that someone of you 
might spot.

But I will definitely look into the custom logging and post my findings.

@ Shahab and Drake: Thank you very much for your help.


Best,
Peter


On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around 
> the code that compares and then outputs the values? Maybe that will 
> give you a clue that what is happening? Debug the threshold value that 
> you get in the reducer and whether that is what you have set or not 
> (in case of when you set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables 
> though I doubt that would make any difference.
>
> Shahab
>
> On May 12, 2015 8:17 AM, "Peter Ruch" <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I already skimmed through the logs but I could not find anything
>     special.
>
>     I am just really confused why I am having this problem.
>
>     If the Iterable<...> for a specific key contains all of the
>     observed values - and it seems to do so
>     otherwise the program wouldn't work correctly in the standard case
>     with [[ threshold = -1 ]] -
>     it should also work when I only write the key-value pairs to the
>     output file that suffice the condition [[ sum > threshold ]].
>
>     Did I miss something? Maybe I have to handle these cases in a
>     specific way, but I did not find anything about that online.
>
>
>     Thank you for your help,
>
>     Peter
>
>
>
>     On 12.05.2015 12:35, Drake민영근 wrote:
>>     Hi, Peter
>>
>>     The missing records, they are just gone without no logs? How
>>     about your reduce tasks logs?
>>
>>     Thanks
>>
>>     Drake 민영근 Ph.D
>>     kt NexR
>>
>>     On Tue, May 12, 2015 at 5:18 AM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hello,
>>
>>         sum and threshold are both Integers.
>>         for the threshold variable I first add a new resource to the
>>         configuration - conf.addResource( ... );
>>
>>         later I get the threshold value from the configuration.
>>
>>         Code
>>         #####################################
>>
>>         private int threshold;
>>
>>         public void setup( Context context ) {
>>
>>                   Configuration conf = context.getConfiguration();
>>                   threshold = conf.getInt( "threshold", -1 );
>>
>>         }
>>
>>         #####################################
>>
>>
>>         Best,
>>         Peter
>>
>>
>>
>>         On 11.05.2015 19:26, Shahab Yunus wrote:
>>>         What is the type of the threshold variable? sum I believe is
>>>         a Java int.
>>>
>>>         Regards,
>>>         Shahab
>>>
>>>         On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>>         <rutschifengga@gmail.com <ma...@gmail.com>>
>>>         wrote:
>>>
>>>             Hi,
>>>
>>>             I am currently playing around with Hadoop and have some
>>>             problems when trying to filter in the Reducer.
>>>
>>>             I extended the WordCount v1.0 example from the 2.7
>>>             MapReduce Tutorial with some additional functionality
>>>             and added the possibility to filter by the specific
>>>             value of each key - e.g. only output the key-value pairs
>>>             where [[ value > threshold ]].
>>>
>>>             Filtering Code in Reducer
>>>             #####################################
>>>
>>>             for (IntWritable val : values) {
>>>                  sum += val.get();
>>>             }
>>>             if ( sum > threshold ) {
>>>                  result.set(sum);
>>>                  context.write(key, result);
>>>             }
>>>
>>>             #####################################
>>>
>>>             For threshold smaller any value the above code works as
>>>             expected and the output contains all key-value pairs.
>>>             If I increase the threshold to 1 some pairs are missing
>>>             in the output although the respective value would be
>>>             larger than the threshold.
>>>
>>>             I tried to work out the error myself, but I could not
>>>             get it to work as intended. I use the exact Tutorial
>>>             setup with Oracle JDK 8
>>>             on a CentOS 7 machine.
>>>
>>>             As far as I understand the respective Iterable<...>  in
>>>             the Reducer already contains all the observed values for
>>>             a specific key.
>>>             Why is it possible that I am missing some of these
>>>             key-value pairs then? It only fails in very few cases.
>>>             The input file is pretty large - 250 MB -
>>>             so I also tried to increase the memory for the mapping
>>>             and reduction steps but it did not help ( tried a lot of
>>>             different stuff without success )
>>>
>>>             Maybe someone already experienced similar problems / is
>>>             more experienced than I am.
>>>
>>>
>>>             Thank you,
>>>
>>>             Peter
>>>
>>>
>>
>>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

Have you tried explicitly printing or logging in you reducer around the
code that compares and then outputs the values? Maybe that will give you a
clue that what is happening? Debug the threshold value that you get in the
reducer and whether that is what you have set or not (in case of when you
set it to greater than -1)?

You can also try to use compare method for comparing IntWritables though I
doubt that would make any difference.

Shahab
On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:

>  Hi,
>
> I already skimmed through the logs but I could not find anything special.
>
> I am just really confused why I am having this problem.
>
> If the Iterable<...> for a specific key contains all of the observed
> values - and it seems to do so
> otherwise the program wouldn't work correctly in the standard case with [[
> threshold = -1 ]] -
> it should also work when I only write the key-value pairs to the output
> file that suffice the condition [[ sum > threshold ]].
>
> Did I miss something? Maybe I have to handle these cases in a specific
> way, but I did not find anything about that online.
>
>
> Thank you for your help,
>
> Peter
>
>
>
> On 12.05.2015 12:35, Drake민영근 wrote:
>
> Hi, Peter
>
>  The missing records, they are just gone without no logs? How about your
> reduce tasks logs?
>
>  Thanks
>
>   Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>  Hello,
>>
>> sum and threshold are both Integers.
>> for the threshold variable I first add a new resource to the
>> configuration - conf.addResource( ... );
>>
>> later I get the threshold value from the configuration.
>>
>> Code
>> #####################################
>>
>> private int threshold;
>>
>> public void setup( Context context ) {
>>
>>           Configuration conf = context.getConfiguration();
>>           threshold = conf.getInt( "threshold", -1 );
>>
>> }
>>
>> #####################################
>>
>>
>> Best,
>> Peter
>>
>>
>>
>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>
>> What is the type of the threshold variable? sum I believe is a Java int.
>>
>>  Regards,
>> Shahab
>>
>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>   Hi,
>>>
>>>  I am currently playing around with Hadoop and have some problems when
>>> trying to filter in the Reducer.
>>>
>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>> with some additional functionality
>>> and added the possibility to filter by the specific value of each key -
>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>
>>>  Filtering Code in Reducer
>>>  #####################################
>>>
>>>  for (IntWritable val : values) {
>>>      sum += val.get();
>>> }
>>> if ( sum > threshold ) {
>>>      result.set(sum);
>>>      context.write(key, result);
>>> }
>>>
>>> #####################################
>>>
>>>  For threshold smaller any value the above code works as expected and
>>> the output contains all key-value pairs.
>>>  If I increase the threshold to 1 some pairs are missing in the output
>>> although the respective value would be larger than the threshold.
>>>
>>>  I tried to work out the error myself, but I could not get it to work
>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>  on a CentOS 7 machine.
>>>
>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>> already contains all the observed values for a specific key.
>>>  Why is it possible that I am missing some of these key-value pairs
>>> then? It only fails in very few cases. The input file is pretty large - 250
>>> MB -
>>>  so I also tried to increase the memory for the mapping and reduction
>>> steps but it did not help ( tried a lot of different stuff without success )
>>>
>>>  Maybe someone already experienced similar problems / is more
>>> experienced than I am.
>>>
>>>
>>>  Thank you,
>>>
>>>  Peter
>>>
>>
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

Have you tried explicitly printing or logging in you reducer around the
code that compares and then outputs the values? Maybe that will give you a
clue that what is happening? Debug the threshold value that you get in the
reducer and whether that is what you have set or not (in case of when you
set it to greater than -1)?

You can also try to use compare method for comparing IntWritables though I
doubt that would make any difference.

Shahab
On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:

>  Hi,
>
> I already skimmed through the logs but I could not find anything special.
>
> I am just really confused why I am having this problem.
>
> If the Iterable<...> for a specific key contains all of the observed
> values - and it seems to do so
> otherwise the program wouldn't work correctly in the standard case with [[
> threshold = -1 ]] -
> it should also work when I only write the key-value pairs to the output
> file that suffice the condition [[ sum > threshold ]].
>
> Did I miss something? Maybe I have to handle these cases in a specific
> way, but I did not find anything about that online.
>
>
> Thank you for your help,
>
> Peter
>
>
>
> On 12.05.2015 12:35, Drake민영근 wrote:
>
> Hi, Peter
>
>  The missing records, they are just gone without no logs? How about your
> reduce tasks logs?
>
>  Thanks
>
>   Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>  Hello,
>>
>> sum and threshold are both Integers.
>> for the threshold variable I first add a new resource to the
>> configuration - conf.addResource( ... );
>>
>> later I get the threshold value from the configuration.
>>
>> Code
>> #####################################
>>
>> private int threshold;
>>
>> public void setup( Context context ) {
>>
>>           Configuration conf = context.getConfiguration();
>>           threshold = conf.getInt( "threshold", -1 );
>>
>> }
>>
>> #####################################
>>
>>
>> Best,
>> Peter
>>
>>
>>
>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>
>> What is the type of the threshold variable? sum I believe is a Java int.
>>
>>  Regards,
>> Shahab
>>
>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>   Hi,
>>>
>>>  I am currently playing around with Hadoop and have some problems when
>>> trying to filter in the Reducer.
>>>
>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>> with some additional functionality
>>> and added the possibility to filter by the specific value of each key -
>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>
>>>  Filtering Code in Reducer
>>>  #####################################
>>>
>>>  for (IntWritable val : values) {
>>>      sum += val.get();
>>> }
>>> if ( sum > threshold ) {
>>>      result.set(sum);
>>>      context.write(key, result);
>>> }
>>>
>>> #####################################
>>>
>>>  For threshold smaller any value the above code works as expected and
>>> the output contains all key-value pairs.
>>>  If I increase the threshold to 1 some pairs are missing in the output
>>> although the respective value would be larger than the threshold.
>>>
>>>  I tried to work out the error myself, but I could not get it to work
>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>  on a CentOS 7 machine.
>>>
>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>> already contains all the observed values for a specific key.
>>>  Why is it possible that I am missing some of these key-value pairs
>>> then? It only fails in very few cases. The input file is pretty large - 250
>>> MB -
>>>  so I also tried to increase the memory for the mapping and reduction
>>> steps but it did not help ( tried a lot of different stuff without success )
>>>
>>>  Maybe someone already experienced similar problems / is more
>>> experienced than I am.
>>>
>>>
>>>  Thank you,
>>>
>>>  Peter
>>>
>>
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

Have you tried explicitly printing or logging in you reducer around the
code that compares and then outputs the values? Maybe that will give you a
clue that what is happening? Debug the threshold value that you get in the
reducer and whether that is what you have set or not (in case of when you
set it to greater than -1)?

You can also try to use compare method for comparing IntWritables though I
doubt that would make any difference.

Shahab
On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:

>  Hi,
>
> I already skimmed through the logs but I could not find anything special.
>
> I am just really confused why I am having this problem.
>
> If the Iterable<...> for a specific key contains all of the observed
> values - and it seems to do so
> otherwise the program wouldn't work correctly in the standard case with [[
> threshold = -1 ]] -
> it should also work when I only write the key-value pairs to the output
> file that suffice the condition [[ sum > threshold ]].
>
> Did I miss something? Maybe I have to handle these cases in a specific
> way, but I did not find anything about that online.
>
>
> Thank you for your help,
>
> Peter
>
>
>
> On 12.05.2015 12:35, Drake민영근 wrote:
>
> Hi, Peter
>
>  The missing records, they are just gone without no logs? How about your
> reduce tasks logs?
>
>  Thanks
>
>   Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>  Hello,
>>
>> sum and threshold are both Integers.
>> for the threshold variable I first add a new resource to the
>> configuration - conf.addResource( ... );
>>
>> later I get the threshold value from the configuration.
>>
>> Code
>> #####################################
>>
>> private int threshold;
>>
>> public void setup( Context context ) {
>>
>>           Configuration conf = context.getConfiguration();
>>           threshold = conf.getInt( "threshold", -1 );
>>
>> }
>>
>> #####################################
>>
>>
>> Best,
>> Peter
>>
>>
>>
>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>
>> What is the type of the threshold variable? sum I believe is a Java int.
>>
>>  Regards,
>> Shahab
>>
>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>   Hi,
>>>
>>>  I am currently playing around with Hadoop and have some problems when
>>> trying to filter in the Reducer.
>>>
>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>> with some additional functionality
>>> and added the possibility to filter by the specific value of each key -
>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>
>>>  Filtering Code in Reducer
>>>  #####################################
>>>
>>>  for (IntWritable val : values) {
>>>      sum += val.get();
>>> }
>>> if ( sum > threshold ) {
>>>      result.set(sum);
>>>      context.write(key, result);
>>> }
>>>
>>> #####################################
>>>
>>>  For threshold smaller any value the above code works as expected and
>>> the output contains all key-value pairs.
>>>  If I increase the threshold to 1 some pairs are missing in the output
>>> although the respective value would be larger than the threshold.
>>>
>>>  I tried to work out the error myself, but I could not get it to work
>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>  on a CentOS 7 machine.
>>>
>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>> already contains all the observed values for a specific key.
>>>  Why is it possible that I am missing some of these key-value pairs
>>> then? It only fails in very few cases. The input file is pretty large - 250
>>> MB -
>>>  so I also tried to increase the memory for the mapping and reduction
>>> steps but it did not help ( tried a lot of different stuff without success )
>>>
>>>  Maybe someone already experienced similar problems / is more
>>> experienced than I am.
>>>
>>>
>>>  Thank you,
>>>
>>>  Peter
>>>
>>
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

Have you tried explicitly printing or logging in you reducer around the
code that compares and then outputs the values? Maybe that will give you a
clue that what is happening? Debug the threshold value that you get in the
reducer and whether that is what you have set or not (in case of when you
set it to greater than -1)?

You can also try to use compare method for comparing IntWritables though I
doubt that would make any difference.

Shahab
On May 12, 2015 8:17 AM, "Peter Ruch" <ru...@gmail.com> wrote:

>  Hi,
>
> I already skimmed through the logs but I could not find anything special.
>
> I am just really confused why I am having this problem.
>
> If the Iterable<...> for a specific key contains all of the observed
> values - and it seems to do so
> otherwise the program wouldn't work correctly in the standard case with [[
> threshold = -1 ]] -
> it should also work when I only write the key-value pairs to the output
> file that suffice the condition [[ sum > threshold ]].
>
> Did I miss something? Maybe I have to handle these cases in a specific
> way, but I did not find anything about that online.
>
>
> Thank you for your help,
>
> Peter
>
>
>
> On 12.05.2015 12:35, Drake민영근 wrote:
>
> Hi, Peter
>
>  The missing records, they are just gone without no logs? How about your
> reduce tasks logs?
>
>  Thanks
>
>   Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>  Hello,
>>
>> sum and threshold are both Integers.
>> for the threshold variable I first add a new resource to the
>> configuration - conf.addResource( ... );
>>
>> later I get the threshold value from the configuration.
>>
>> Code
>> #####################################
>>
>> private int threshold;
>>
>> public void setup( Context context ) {
>>
>>           Configuration conf = context.getConfiguration();
>>           threshold = conf.getInt( "threshold", -1 );
>>
>> }
>>
>> #####################################
>>
>>
>> Best,
>> Peter
>>
>>
>>
>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>
>> What is the type of the threshold variable? sum I believe is a Java int.
>>
>>  Regards,
>> Shahab
>>
>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
>> wrote:
>>
>>>   Hi,
>>>
>>>  I am currently playing around with Hadoop and have some problems when
>>> trying to filter in the Reducer.
>>>
>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>> with some additional functionality
>>> and added the possibility to filter by the specific value of each key -
>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>
>>>  Filtering Code in Reducer
>>>  #####################################
>>>
>>>  for (IntWritable val : values) {
>>>      sum += val.get();
>>> }
>>> if ( sum > threshold ) {
>>>      result.set(sum);
>>>      context.write(key, result);
>>> }
>>>
>>> #####################################
>>>
>>>  For threshold smaller any value the above code works as expected and
>>> the output contains all key-value pairs.
>>>  If I increase the threshold to 1 some pairs are missing in the output
>>> although the respective value would be larger than the threshold.
>>>
>>>  I tried to work out the error myself, but I could not get it to work
>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>  on a CentOS 7 machine.
>>>
>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>> already contains all the observed values for a specific key.
>>>  Why is it possible that I am missing some of these key-value pairs
>>> then? It only fails in very few cases. The input file is pretty large - 250
>>> MB -
>>>  so I also tried to increase the memory for the mapping and reduction
>>> steps but it did not help ( tried a lot of different stuff without success )
>>>
>>>  Maybe someone already experienced similar problems / is more
>>> experienced than I am.
>>>
>>>
>>>  Thank you,
>>>
>>>  Peter
>>>
>>
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

I already skimmed through the logs but I could not find anything special.

I am just really confused why I am having this problem.

If the Iterable<...> for a specific key contains all of the observed 
values - and it seems to do so
otherwise the program wouldn't work correctly in the standard case with 
[[ threshold = -1 ]] -
it should also work when I only write the key-value pairs to the output 
file that suffice the condition [[ sum > threshold ]].

Did I miss something? Maybe I have to handle these cases in a specific 
way, but I did not find anything about that online.


Thank you for your help,

Peter



On 12.05.2015 12:35, Drake민영근 wrote:
> Hi, Peter
>
> The missing records, they are just gone without no logs? How about 
> your reduce tasks logs?
>
> Thanks
>
> Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     sum and threshold are both Integers.
>     for the threshold variable I first add a new resource to the
>     configuration - conf.addResource( ... );
>
>     later I get the threshold value from the configuration.
>
>     Code
>     #####################################
>
>     private int threshold;
>
>     public void setup( Context context ) {
>
>               Configuration conf = context.getConfiguration();
>               threshold = conf.getInt( "threshold", -1 );
>
>     }
>
>     #####################################
>
>
>     Best,
>     Peter
>
>
>
>     On 11.05.2015 19:26, Shahab Yunus wrote:
>>     What is the type of the threshold variable? sum I believe is a
>>     Java int.
>>
>>     Regards,
>>     Shahab
>>
>>     On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I am currently playing around with Hadoop and have some
>>         problems when trying to filter in the Reducer.
>>
>>         I extended the WordCount v1.0 example from the 2.7 MapReduce
>>         Tutorial with some additional functionality
>>         and added the possibility to filter by the specific value of
>>         each key - e.g. only output the key-value pairs where [[
>>         value > threshold ]].
>>
>>         Filtering Code in Reducer
>>         #####################################
>>
>>         for (IntWritable val : values) {
>>              sum += val.get();
>>         }
>>         if ( sum > threshold ) {
>>              result.set(sum);
>>              context.write(key, result);
>>         }
>>
>>         #####################################
>>
>>         For threshold smaller any value the above code works as
>>         expected and the output contains all key-value pairs.
>>         If I increase the threshold to 1 some pairs are missing in
>>         the output although the respective value would be larger than
>>         the threshold.
>>
>>         I tried to work out the error myself, but I could not get it
>>         to work as intended. I use the exact Tutorial setup with
>>         Oracle JDK 8
>>         on a CentOS 7 machine.
>>
>>         As far as I understand the respective Iterable<...>  in the
>>         Reducer already contains all the observed values for a
>>         specific key.
>>         Why is it possible that I am missing some of these key-value
>>         pairs then? It only fails in very few cases. The input file
>>         is pretty large - 250 MB -
>>         so I also tried to increase the memory for the mapping and
>>         reduction steps but it did not help ( tried a lot of
>>         different stuff without success )
>>
>>         Maybe someone already experienced similar problems / is more
>>         experienced than I am.
>>
>>
>>         Thank you,
>>
>>         Peter
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

I already skimmed through the logs but I could not find anything special.

I am just really confused why I am having this problem.

If the Iterable<...> for a specific key contains all of the observed 
values - and it seems to do so
otherwise the program wouldn't work correctly in the standard case with 
[[ threshold = -1 ]] -
it should also work when I only write the key-value pairs to the output 
file that suffice the condition [[ sum > threshold ]].

Did I miss something? Maybe I have to handle these cases in a specific 
way, but I did not find anything about that online.


Thank you for your help,

Peter



On 12.05.2015 12:35, Drake민영근 wrote:
> Hi, Peter
>
> The missing records, they are just gone without no logs? How about 
> your reduce tasks logs?
>
> Thanks
>
> Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     sum and threshold are both Integers.
>     for the threshold variable I first add a new resource to the
>     configuration - conf.addResource( ... );
>
>     later I get the threshold value from the configuration.
>
>     Code
>     #####################################
>
>     private int threshold;
>
>     public void setup( Context context ) {
>
>               Configuration conf = context.getConfiguration();
>               threshold = conf.getInt( "threshold", -1 );
>
>     }
>
>     #####################################
>
>
>     Best,
>     Peter
>
>
>
>     On 11.05.2015 19:26, Shahab Yunus wrote:
>>     What is the type of the threshold variable? sum I believe is a
>>     Java int.
>>
>>     Regards,
>>     Shahab
>>
>>     On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I am currently playing around with Hadoop and have some
>>         problems when trying to filter in the Reducer.
>>
>>         I extended the WordCount v1.0 example from the 2.7 MapReduce
>>         Tutorial with some additional functionality
>>         and added the possibility to filter by the specific value of
>>         each key - e.g. only output the key-value pairs where [[
>>         value > threshold ]].
>>
>>         Filtering Code in Reducer
>>         #####################################
>>
>>         for (IntWritable val : values) {
>>              sum += val.get();
>>         }
>>         if ( sum > threshold ) {
>>              result.set(sum);
>>              context.write(key, result);
>>         }
>>
>>         #####################################
>>
>>         For threshold smaller any value the above code works as
>>         expected and the output contains all key-value pairs.
>>         If I increase the threshold to 1 some pairs are missing in
>>         the output although the respective value would be larger than
>>         the threshold.
>>
>>         I tried to work out the error myself, but I could not get it
>>         to work as intended. I use the exact Tutorial setup with
>>         Oracle JDK 8
>>         on a CentOS 7 machine.
>>
>>         As far as I understand the respective Iterable<...>  in the
>>         Reducer already contains all the observed values for a
>>         specific key.
>>         Why is it possible that I am missing some of these key-value
>>         pairs then? It only fails in very few cases. The input file
>>         is pretty large - 250 MB -
>>         so I also tried to increase the memory for the mapping and
>>         reduction steps but it did not help ( tried a lot of
>>         different stuff without success )
>>
>>         Maybe someone already experienced similar problems / is more
>>         experienced than I am.
>>
>>
>>         Thank you,
>>
>>         Peter
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

I already skimmed through the logs but I could not find anything special.

I am just really confused why I am having this problem.

If the Iterable<...> for a specific key contains all of the observed 
values - and it seems to do so
otherwise the program wouldn't work correctly in the standard case with 
[[ threshold = -1 ]] -
it should also work when I only write the key-value pairs to the output 
file that suffice the condition [[ sum > threshold ]].

Did I miss something? Maybe I have to handle these cases in a specific 
way, but I did not find anything about that online.


Thank you for your help,

Peter



On 12.05.2015 12:35, Drake민영근 wrote:
> Hi, Peter
>
> The missing records, they are just gone without no logs? How about 
> your reduce tasks logs?
>
> Thanks
>
> Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     sum and threshold are both Integers.
>     for the threshold variable I first add a new resource to the
>     configuration - conf.addResource( ... );
>
>     later I get the threshold value from the configuration.
>
>     Code
>     #####################################
>
>     private int threshold;
>
>     public void setup( Context context ) {
>
>               Configuration conf = context.getConfiguration();
>               threshold = conf.getInt( "threshold", -1 );
>
>     }
>
>     #####################################
>
>
>     Best,
>     Peter
>
>
>
>     On 11.05.2015 19:26, Shahab Yunus wrote:
>>     What is the type of the threshold variable? sum I believe is a
>>     Java int.
>>
>>     Regards,
>>     Shahab
>>
>>     On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I am currently playing around with Hadoop and have some
>>         problems when trying to filter in the Reducer.
>>
>>         I extended the WordCount v1.0 example from the 2.7 MapReduce
>>         Tutorial with some additional functionality
>>         and added the possibility to filter by the specific value of
>>         each key - e.g. only output the key-value pairs where [[
>>         value > threshold ]].
>>
>>         Filtering Code in Reducer
>>         #####################################
>>
>>         for (IntWritable val : values) {
>>              sum += val.get();
>>         }
>>         if ( sum > threshold ) {
>>              result.set(sum);
>>              context.write(key, result);
>>         }
>>
>>         #####################################
>>
>>         For threshold smaller any value the above code works as
>>         expected and the output contains all key-value pairs.
>>         If I increase the threshold to 1 some pairs are missing in
>>         the output although the respective value would be larger than
>>         the threshold.
>>
>>         I tried to work out the error myself, but I could not get it
>>         to work as intended. I use the exact Tutorial setup with
>>         Oracle JDK 8
>>         on a CentOS 7 machine.
>>
>>         As far as I understand the respective Iterable<...>  in the
>>         Reducer already contains all the observed values for a
>>         specific key.
>>         Why is it possible that I am missing some of these key-value
>>         pairs then? It only fails in very few cases. The input file
>>         is pretty large - 250 MB -
>>         so I also tried to increase the memory for the mapping and
>>         reduction steps but it did not help ( tried a lot of
>>         different stuff without success )
>>
>>         Maybe someone already experienced similar problems / is more
>>         experienced than I am.
>>
>>
>>         Thank you,
>>
>>         Peter
>>
>>
>
>

Re: Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hi,

I already skimmed through the logs but I could not find anything special.

I am just really confused why I am having this problem.

If the Iterable<...> for a specific key contains all of the observed 
values - and it seems to do so
otherwise the program wouldn't work correctly in the standard case with 
[[ threshold = -1 ]] -
it should also work when I only write the key-value pairs to the output 
file that suffice the condition [[ sum > threshold ]].

Did I miss something? Maybe I have to handle these cases in a specific 
way, but I did not find anything about that online.


Thank you for your help,

Peter



On 12.05.2015 12:35, Drake민영근 wrote:
> Hi, Peter
>
> The missing records, they are just gone without no logs? How about 
> your reduce tasks logs?
>
> Thanks
>
> Drake 민영근 Ph.D
> kt NexR
>
> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     sum and threshold are both Integers.
>     for the threshold variable I first add a new resource to the
>     configuration - conf.addResource( ... );
>
>     later I get the threshold value from the configuration.
>
>     Code
>     #####################################
>
>     private int threshold;
>
>     public void setup( Context context ) {
>
>               Configuration conf = context.getConfiguration();
>               threshold = conf.getInt( "threshold", -1 );
>
>     }
>
>     #####################################
>
>
>     Best,
>     Peter
>
>
>
>     On 11.05.2015 19:26, Shahab Yunus wrote:
>>     What is the type of the threshold variable? sum I believe is a
>>     Java int.
>>
>>     Regards,
>>     Shahab
>>
>>     On Mon, May 11, 2015 at 1:08 PM, Peter Ruch
>>     <rutschifengga@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I am currently playing around with Hadoop and have some
>>         problems when trying to filter in the Reducer.
>>
>>         I extended the WordCount v1.0 example from the 2.7 MapReduce
>>         Tutorial with some additional functionality
>>         and added the possibility to filter by the specific value of
>>         each key - e.g. only output the key-value pairs where [[
>>         value > threshold ]].
>>
>>         Filtering Code in Reducer
>>         #####################################
>>
>>         for (IntWritable val : values) {
>>              sum += val.get();
>>         }
>>         if ( sum > threshold ) {
>>              result.set(sum);
>>              context.write(key, result);
>>         }
>>
>>         #####################################
>>
>>         For threshold smaller any value the above code works as
>>         expected and the output contains all key-value pairs.
>>         If I increase the threshold to 1 some pairs are missing in
>>         the output although the respective value would be larger than
>>         the threshold.
>>
>>         I tried to work out the error myself, but I could not get it
>>         to work as intended. I use the exact Tutorial setup with
>>         Oracle JDK 8
>>         on a CentOS 7 machine.
>>
>>         As far as I understand the respective Iterable<...>  in the
>>         Reducer already contains all the observed values for a
>>         specific key.
>>         Why is it possible that I am missing some of these key-value
>>         pairs then? It only fails in very few cases. The input file
>>         is pretty large - 250 MB -
>>         so I also tried to increase the memory for the mapping and
>>         reduction steps but it did not help ( tried a lot of
>>         different stuff without success )
>>
>>         Maybe someone already experienced similar problems / is more
>>         experienced than I am.
>>
>>
>>         Thank you,
>>
>>         Peter
>>
>>
>
>

Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi, Peter

The missing records, they are just gone without no logs? How about your
reduce tasks logs?

Thanks

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com> wrote:

>  Hello,
>
> sum and threshold are both Integers.
> for the threshold variable I first add a new resource to the configuration
> - conf.addResource( ... );
>
> later I get the threshold value from the configuration.
>
> Code
> #####################################
>
> private int threshold;
>
> public void setup( Context context ) {
>
>           Configuration conf = context.getConfiguration();
>           threshold = conf.getInt( "threshold", -1 );
>
> }
>
> #####################################
>
>
> Best,
> Peter
>
>
>
> On 11.05.2015 19:26, Shahab Yunus wrote:
>
> What is the type of the threshold variable? sum I believe is a Java int.
>
>  Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>   Hi,
>>
>>  I am currently playing around with Hadoop and have some problems when
>> trying to filter in the Reducer.
>>
>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>> with some additional functionality
>> and added the possibility to filter by the specific value of each key -
>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>
>>  Filtering Code in Reducer
>>  #####################################
>>
>>  for (IntWritable val : values) {
>>      sum += val.get();
>> }
>> if ( sum > threshold ) {
>>      result.set(sum);
>>      context.write(key, result);
>> }
>>
>> #####################################
>>
>>  For threshold smaller any value the above code works as expected and
>> the output contains all key-value pairs.
>>  If I increase the threshold to 1 some pairs are missing in the output
>> although the respective value would be larger than the threshold.
>>
>>  I tried to work out the error myself, but I could not get it to work as
>> intended. I use the exact Tutorial setup with Oracle JDK 8
>>  on a CentOS 7 machine.
>>
>>  As far as I understand the respective Iterable<...>  in the Reducer
>> already contains all the observed values for a specific key.
>>  Why is it possible that I am missing some of these key-value pairs
>> then? It only fails in very few cases. The input file is pretty large - 250
>> MB -
>>  so I also tried to increase the memory for the mapping and reduction
>> steps but it did not help ( tried a lot of different stuff without success )
>>
>>  Maybe someone already experienced similar problems / is more
>> experienced than I am.
>>
>>
>>  Thank you,
>>
>>  Peter
>>
>
>
>

Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi, Peter

The missing records, they are just gone without no logs? How about your
reduce tasks logs?

Thanks

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com> wrote:

>  Hello,
>
> sum and threshold are both Integers.
> for the threshold variable I first add a new resource to the configuration
> - conf.addResource( ... );
>
> later I get the threshold value from the configuration.
>
> Code
> #####################################
>
> private int threshold;
>
> public void setup( Context context ) {
>
>           Configuration conf = context.getConfiguration();
>           threshold = conf.getInt( "threshold", -1 );
>
> }
>
> #####################################
>
>
> Best,
> Peter
>
>
>
> On 11.05.2015 19:26, Shahab Yunus wrote:
>
> What is the type of the threshold variable? sum I believe is a Java int.
>
>  Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>   Hi,
>>
>>  I am currently playing around with Hadoop and have some problems when
>> trying to filter in the Reducer.
>>
>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>> with some additional functionality
>> and added the possibility to filter by the specific value of each key -
>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>
>>  Filtering Code in Reducer
>>  #####################################
>>
>>  for (IntWritable val : values) {
>>      sum += val.get();
>> }
>> if ( sum > threshold ) {
>>      result.set(sum);
>>      context.write(key, result);
>> }
>>
>> #####################################
>>
>>  For threshold smaller any value the above code works as expected and
>> the output contains all key-value pairs.
>>  If I increase the threshold to 1 some pairs are missing in the output
>> although the respective value would be larger than the threshold.
>>
>>  I tried to work out the error myself, but I could not get it to work as
>> intended. I use the exact Tutorial setup with Oracle JDK 8
>>  on a CentOS 7 machine.
>>
>>  As far as I understand the respective Iterable<...>  in the Reducer
>> already contains all the observed values for a specific key.
>>  Why is it possible that I am missing some of these key-value pairs
>> then? It only fails in very few cases. The input file is pretty large - 250
>> MB -
>>  so I also tried to increase the memory for the mapping and reduction
>> steps but it did not help ( tried a lot of different stuff without success )
>>
>>  Maybe someone already experienced similar problems / is more
>> experienced than I am.
>>
>>
>>  Thank you,
>>
>>  Peter
>>
>
>
>

Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi, Peter

The missing records, they are just gone without no logs? How about your
reduce tasks logs?

Thanks

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com> wrote:

>  Hello,
>
> sum and threshold are both Integers.
> for the threshold variable I first add a new resource to the configuration
> - conf.addResource( ... );
>
> later I get the threshold value from the configuration.
>
> Code
> #####################################
>
> private int threshold;
>
> public void setup( Context context ) {
>
>           Configuration conf = context.getConfiguration();
>           threshold = conf.getInt( "threshold", -1 );
>
> }
>
> #####################################
>
>
> Best,
> Peter
>
>
>
> On 11.05.2015 19:26, Shahab Yunus wrote:
>
> What is the type of the threshold variable? sum I believe is a Java int.
>
>  Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>   Hi,
>>
>>  I am currently playing around with Hadoop and have some problems when
>> trying to filter in the Reducer.
>>
>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>> with some additional functionality
>> and added the possibility to filter by the specific value of each key -
>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>
>>  Filtering Code in Reducer
>>  #####################################
>>
>>  for (IntWritable val : values) {
>>      sum += val.get();
>> }
>> if ( sum > threshold ) {
>>      result.set(sum);
>>      context.write(key, result);
>> }
>>
>> #####################################
>>
>>  For threshold smaller any value the above code works as expected and
>> the output contains all key-value pairs.
>>  If I increase the threshold to 1 some pairs are missing in the output
>> although the respective value would be larger than the threshold.
>>
>>  I tried to work out the error myself, but I could not get it to work as
>> intended. I use the exact Tutorial setup with Oracle JDK 8
>>  on a CentOS 7 machine.
>>
>>  As far as I understand the respective Iterable<...>  in the Reducer
>> already contains all the observed values for a specific key.
>>  Why is it possible that I am missing some of these key-value pairs
>> then? It only fails in very few cases. The input file is pretty large - 250
>> MB -
>>  so I also tried to increase the memory for the mapping and reduction
>> steps but it did not help ( tried a lot of different stuff without success )
>>
>>  Maybe someone already experienced similar problems / is more
>> experienced than I am.
>>
>>
>>  Thank you,
>>
>>  Peter
>>
>
>
>

Re: Re: Filtering by value in Reducer

Posted by Drake민영근 <dr...@nexr.com>.

Hi, Peter

The missing records, they are just gone without no logs? How about your
reduce tasks logs?

Thanks

Drake 민영근 Ph.D
kt NexR

On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <ru...@gmail.com> wrote:

>  Hello,
>
> sum and threshold are both Integers.
> for the threshold variable I first add a new resource to the configuration
> - conf.addResource( ... );
>
> later I get the threshold value from the configuration.
>
> Code
> #####################################
>
> private int threshold;
>
> public void setup( Context context ) {
>
>           Configuration conf = context.getConfiguration();
>           threshold = conf.getInt( "threshold", -1 );
>
> }
>
> #####################################
>
>
> Best,
> Peter
>
>
>
> On 11.05.2015 19:26, Shahab Yunus wrote:
>
> What is the type of the threshold variable? sum I believe is a Java int.
>
>  Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com>
> wrote:
>
>>   Hi,
>>
>>  I am currently playing around with Hadoop and have some problems when
>> trying to filter in the Reducer.
>>
>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>> with some additional functionality
>> and added the possibility to filter by the specific value of each key -
>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>
>>  Filtering Code in Reducer
>>  #####################################
>>
>>  for (IntWritable val : values) {
>>      sum += val.get();
>> }
>> if ( sum > threshold ) {
>>      result.set(sum);
>>      context.write(key, result);
>> }
>>
>> #####################################
>>
>>  For threshold smaller any value the above code works as expected and
>> the output contains all key-value pairs.
>>  If I increase the threshold to 1 some pairs are missing in the output
>> although the respective value would be larger than the threshold.
>>
>>  I tried to work out the error myself, but I could not get it to work as
>> intended. I use the exact Tutorial setup with Oracle JDK 8
>>  on a CentOS 7 machine.
>>
>>  As far as I understand the respective Iterable<...>  in the Reducer
>> already contains all the observed values for a specific key.
>>  Why is it possible that I am missing some of these key-value pairs
>> then? It only fails in very few cases. The input file is pretty large - 250
>> MB -
>>  so I also tried to increase the memory for the mapping and reduction
>> steps but it did not help ( tried a lot of different stuff without success )
>>
>>  Maybe someone already experienced similar problems / is more
>> experienced than I am.
>>
>>
>>  Thank you,
>>
>>  Peter
>>
>
>
>

Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hello,

sum and threshold are both Integers.
for the threshold variable I first add a new resource to the 
configuration - conf.addResource( ... );

later I get the threshold value from the configuration.

Code
#####################################

private int threshold;

public void setup( Context context ) {

           Configuration conf = context.getConfiguration();
           threshold = conf.getInt( "threshold", -1 );

}

#####################################


Best,
Peter


On 11.05.2015 19:26, Shahab Yunus wrote:
> What is the type of the threshold variable? sum I believe is a Java int.
>
> Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I am currently playing around with Hadoop and have some problems
>     when trying to filter in the Reducer.
>
>     I extended the WordCount v1.0 example from the 2.7 MapReduce
>     Tutorial with some additional functionality
>     and added the possibility to filter by the specific value of each
>     key - e.g. only output the key-value pairs where [[ value >
>     threshold ]].
>
>     Filtering Code in Reducer
>     #####################################
>
>     for (IntWritable val : values) {
>          sum += val.get();
>     }
>     if ( sum > threshold ) {
>          result.set(sum);
>          context.write(key, result);
>     }
>
>     #####################################
>
>     For threshold smaller any value the above code works as expected
>     and the output contains all key-value pairs.
>     If I increase the threshold to 1 some pairs are missing in the
>     output although the respective value would be larger than the
>     threshold.
>
>     I tried to work out the error myself, but I could not get it to
>     work as intended. I use the exact Tutorial setup with Oracle JDK 8
>     on a CentOS 7 machine.
>
>     As far as I understand the respective Iterable<...>  in the
>     Reducer already contains all the observed values for a specific key.
>     Why is it possible that I am missing some of these key-value pairs
>     then? It only fails in very few cases. The input file is pretty
>     large - 250 MB -
>     so I also tried to increase the memory for the mapping and
>     reduction steps but it did not help ( tried a lot of different
>     stuff without success )
>
>     Maybe someone already experienced similar problems / is more
>     experienced than I am.
>
>
>     Thank you,
>
>     Peter
>
>

Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hello,

sum and threshold are both Integers.
for the threshold variable I first add a new resource to the 
configuration - conf.addResource( ... );

later I get the threshold value from the configuration.

Code
#####################################

private int threshold;

public void setup( Context context ) {

           Configuration conf = context.getConfiguration();
           threshold = conf.getInt( "threshold", -1 );

}

#####################################


Best,
Peter


On 11.05.2015 19:26, Shahab Yunus wrote:
> What is the type of the threshold variable? sum I believe is a Java int.
>
> Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I am currently playing around with Hadoop and have some problems
>     when trying to filter in the Reducer.
>
>     I extended the WordCount v1.0 example from the 2.7 MapReduce
>     Tutorial with some additional functionality
>     and added the possibility to filter by the specific value of each
>     key - e.g. only output the key-value pairs where [[ value >
>     threshold ]].
>
>     Filtering Code in Reducer
>     #####################################
>
>     for (IntWritable val : values) {
>          sum += val.get();
>     }
>     if ( sum > threshold ) {
>          result.set(sum);
>          context.write(key, result);
>     }
>
>     #####################################
>
>     For threshold smaller any value the above code works as expected
>     and the output contains all key-value pairs.
>     If I increase the threshold to 1 some pairs are missing in the
>     output although the respective value would be larger than the
>     threshold.
>
>     I tried to work out the error myself, but I could not get it to
>     work as intended. I use the exact Tutorial setup with Oracle JDK 8
>     on a CentOS 7 machine.
>
>     As far as I understand the respective Iterable<...>  in the
>     Reducer already contains all the observed values for a specific key.
>     Why is it possible that I am missing some of these key-value pairs
>     then? It only fails in very few cases. The input file is pretty
>     large - 250 MB -
>     so I also tried to increase the memory for the mapping and
>     reduction steps but it did not help ( tried a lot of different
>     stuff without success )
>
>     Maybe someone already experienced similar problems / is more
>     experienced than I am.
>
>
>     Thank you,
>
>     Peter
>
>

Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hello,

sum and threshold are both Integers.
for the threshold variable I first add a new resource to the 
configuration - conf.addResource( ... );

later I get the threshold value from the configuration.

Code
#####################################

private int threshold;

public void setup( Context context ) {

           Configuration conf = context.getConfiguration();
           threshold = conf.getInt( "threshold", -1 );

}

#####################################


Best,
Peter


On 11.05.2015 19:26, Shahab Yunus wrote:
> What is the type of the threshold variable? sum I believe is a Java int.
>
> Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I am currently playing around with Hadoop and have some problems
>     when trying to filter in the Reducer.
>
>     I extended the WordCount v1.0 example from the 2.7 MapReduce
>     Tutorial with some additional functionality
>     and added the possibility to filter by the specific value of each
>     key - e.g. only output the key-value pairs where [[ value >
>     threshold ]].
>
>     Filtering Code in Reducer
>     #####################################
>
>     for (IntWritable val : values) {
>          sum += val.get();
>     }
>     if ( sum > threshold ) {
>          result.set(sum);
>          context.write(key, result);
>     }
>
>     #####################################
>
>     For threshold smaller any value the above code works as expected
>     and the output contains all key-value pairs.
>     If I increase the threshold to 1 some pairs are missing in the
>     output although the respective value would be larger than the
>     threshold.
>
>     I tried to work out the error myself, but I could not get it to
>     work as intended. I use the exact Tutorial setup with Oracle JDK 8
>     on a CentOS 7 machine.
>
>     As far as I understand the respective Iterable<...>  in the
>     Reducer already contains all the observed values for a specific key.
>     Why is it possible that I am missing some of these key-value pairs
>     then? It only fails in very few cases. The input file is pretty
>     large - 250 MB -
>     so I also tried to increase the memory for the mapping and
>     reduction steps but it did not help ( tried a lot of different
>     stuff without success )
>
>     Maybe someone already experienced similar problems / is more
>     experienced than I am.
>
>
>     Thank you,
>
>     Peter
>
>

Re: Re: Filtering by value in Reducer

Posted by Peter Ruch <ru...@gmail.com>.

Hello,

sum and threshold are both Integers.
for the threshold variable I first add a new resource to the 
configuration - conf.addResource( ... );

later I get the threshold value from the configuration.

Code
#####################################

private int threshold;

public void setup( Context context ) {

           Configuration conf = context.getConfiguration();
           threshold = conf.getInt( "threshold", -1 );

}

#####################################


Best,
Peter


On 11.05.2015 19:26, Shahab Yunus wrote:
> What is the type of the threshold variable? sum I believe is a Java int.
>
> Regards,
> Shahab
>
> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschifengga@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I am currently playing around with Hadoop and have some problems
>     when trying to filter in the Reducer.
>
>     I extended the WordCount v1.0 example from the 2.7 MapReduce
>     Tutorial with some additional functionality
>     and added the possibility to filter by the specific value of each
>     key - e.g. only output the key-value pairs where [[ value >
>     threshold ]].
>
>     Filtering Code in Reducer
>     #####################################
>
>     for (IntWritable val : values) {
>          sum += val.get();
>     }
>     if ( sum > threshold ) {
>          result.set(sum);
>          context.write(key, result);
>     }
>
>     #####################################
>
>     For threshold smaller any value the above code works as expected
>     and the output contains all key-value pairs.
>     If I increase the threshold to 1 some pairs are missing in the
>     output although the respective value would be larger than the
>     threshold.
>
>     I tried to work out the error myself, but I could not get it to
>     work as intended. I use the exact Tutorial setup with Oracle JDK 8
>     on a CentOS 7 machine.
>
>     As far as I understand the respective Iterable<...>  in the
>     Reducer already contains all the observed values for a specific key.
>     Why is it possible that I am missing some of these key-value pairs
>     then? It only fails in very few cases. The input file is pretty
>     large - 250 MB -
>     so I also tried to increase the memory for the mapping and
>     reduction steps but it did not help ( tried a lot of different
>     stuff without success )
>
>     Maybe someone already experienced similar problems / is more
>     experienced than I am.
>
>
>     Thank you,
>
>     Peter
>
>

Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

What is the type of the threshold variable? sum I believe is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com> wrote:

> Hi,
>
> I am currently playing around with Hadoop and have some problems when
> trying to filter in the Reducer.
>
> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
> some additional functionality
> and added the possibility to filter by the specific value of each key -
> e.g. only output the key-value pairs where [[ value > threshold ]].
>
> Filtering Code in Reducer
> #####################################
>
> for (IntWritable val : values) {
>      sum += val.get();
> }
> if ( sum > threshold ) {
>      result.set(sum);
>      context.write(key, result);
> }
>
> #####################################
>
> For threshold smaller any value the above code works as expected and the
> output contains all key-value pairs.
> If I increase the threshold to 1 some pairs are missing in the output
> although the respective value would be larger than the threshold.
>
> I tried to work out the error myself, but I could not get it to work as
> intended. I use the exact Tutorial setup with Oracle JDK 8
> on a CentOS 7 machine.
>
> As far as I understand the respective Iterable<...>  in the Reducer
> already contains all the observed values for a specific key.
> Why is it possible that I am missing some of these key-value pairs then?
> It only fails in very few cases. The input file is pretty large - 250 MB -
> so I also tried to increase the memory for the mapping and reduction steps
> but it did not help ( tried a lot of different stuff without success )
>
> Maybe someone already experienced similar problems / is more experienced
> than I am.
>
>
> Thank you,
>
> Peter
>

Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

What is the type of the threshold variable? sum I believe is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com> wrote:

> Hi,
>
> I am currently playing around with Hadoop and have some problems when
> trying to filter in the Reducer.
>
> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
> some additional functionality
> and added the possibility to filter by the specific value of each key -
> e.g. only output the key-value pairs where [[ value > threshold ]].
>
> Filtering Code in Reducer
> #####################################
>
> for (IntWritable val : values) {
>      sum += val.get();
> }
> if ( sum > threshold ) {
>      result.set(sum);
>      context.write(key, result);
> }
>
> #####################################
>
> For threshold smaller any value the above code works as expected and the
> output contains all key-value pairs.
> If I increase the threshold to 1 some pairs are missing in the output
> although the respective value would be larger than the threshold.
>
> I tried to work out the error myself, but I could not get it to work as
> intended. I use the exact Tutorial setup with Oracle JDK 8
> on a CentOS 7 machine.
>
> As far as I understand the respective Iterable<...>  in the Reducer
> already contains all the observed values for a specific key.
> Why is it possible that I am missing some of these key-value pairs then?
> It only fails in very few cases. The input file is pretty large - 250 MB -
> so I also tried to increase the memory for the mapping and reduction steps
> but it did not help ( tried a lot of different stuff without success )
>
> Maybe someone already experienced similar problems / is more experienced
> than I am.
>
>
> Thank you,
>
> Peter
>

Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

What is the type of the threshold variable? sum I believe is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com> wrote:

> Hi,
>
> I am currently playing around with Hadoop and have some problems when
> trying to filter in the Reducer.
>
> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
> some additional functionality
> and added the possibility to filter by the specific value of each key -
> e.g. only output the key-value pairs where [[ value > threshold ]].
>
> Filtering Code in Reducer
> #####################################
>
> for (IntWritable val : values) {
>      sum += val.get();
> }
> if ( sum > threshold ) {
>      result.set(sum);
>      context.write(key, result);
> }
>
> #####################################
>
> For threshold smaller any value the above code works as expected and the
> output contains all key-value pairs.
> If I increase the threshold to 1 some pairs are missing in the output
> although the respective value would be larger than the threshold.
>
> I tried to work out the error myself, but I could not get it to work as
> intended. I use the exact Tutorial setup with Oracle JDK 8
> on a CentOS 7 machine.
>
> As far as I understand the respective Iterable<...>  in the Reducer
> already contains all the observed values for a specific key.
> Why is it possible that I am missing some of these key-value pairs then?
> It only fails in very few cases. The input file is pretty large - 250 MB -
> so I also tried to increase the memory for the mapping and reduction steps
> but it did not help ( tried a lot of different stuff without success )
>
> Maybe someone already experienced similar problems / is more experienced
> than I am.
>
>
> Thank you,
>
> Peter
>

Re: Filtering by value in Reducer

Posted by Shahab Yunus <sh...@gmail.com>.

What is the type of the threshold variable? sum I believe is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <ru...@gmail.com> wrote:

> Hi,
>
> I am currently playing around with Hadoop and have some problems when
> trying to filter in the Reducer.
>
> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
> some additional functionality
> and added the possibility to filter by the specific value of each key -
> e.g. only output the key-value pairs where [[ value > threshold ]].
>
> Filtering Code in Reducer
> #####################################
>
> for (IntWritable val : values) {
>      sum += val.get();
> }
> if ( sum > threshold ) {
>      result.set(sum);
>      context.write(key, result);
> }
>
> #####################################
>
> For threshold smaller any value the above code works as expected and the
> output contains all key-value pairs.
> If I increase the threshold to 1 some pairs are missing in the output
> although the respective value would be larger than the threshold.
>
> I tried to work out the error myself, but I could not get it to work as
> intended. I use the exact Tutorial setup with Oracle JDK 8
> on a CentOS 7 machine.
>
> As far as I understand the respective Iterable<...>  in the Reducer
> already contains all the observed values for a specific key.
> Why is it possible that I am missing some of these key-value pairs then?
> It only fails in very few cases. The input file is pretty large - 250 MB -
> so I also tried to increase the memory for the mapping and reduction steps
> but it did not help ( tried a lot of different stuff without success )
>
> Maybe someone already experienced similar problems / is more experienced
> than I am.
>
>
> Thank you,
>
> Peter
>