You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Adam Phelps <am...@opendns.com> on 2010/11/05 01:57:40 UTC

Duplicated entries with map job reading from HBase

I've noticed an odd behavior with a map-reduce job I've written which is 
reading data out of an HBase table.  After a couple days of poking at 
this I haven't been able to figure out the cause of the problem, so I 
figured I'd ask on here.

(For reference I'm running with the cdh3b2 release)

The problem is that it seems that every line from the HBase table is 
passed to the mappers twice, thus resulting in counts ending up as 
exactly double what they should be.

I set up the job like this:

             Scan scan = new Scan();
             scan.addFamily(Bytes.toBytes(scanFamily));

             TableMapReduceUtil.initTableMapperJob(table,
                                                   scan,
                                                   mapper,
                                                   Text.class,
                                                   LongWritable.class,
                                                   job);
             job.setCombinerClass(LongSumReducer.class);

             job.setReducerClass(reducer);

I've set up counters in the mapper to verify what is happening, so that 
I know for certain that the mapper is being called twice with the same 
bit of data.  I've also confirmed (using the hbase shell) that each 
entry appears only once in the table.

Is there a known bug along these lines?  If not, does anyone have any 
thoughts on what might be causing this or where I'd start looking to 
diagnose?

Thanks
- Adam

Re: Duplicated entries with map job reading from HBase

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

The only thing I could thinks of right now is that perhaps you have
something similar as the effect an encountered a while age and that has been
described here:
https://issues.apache.org/jira/browse/MAPREDUCE-2094

This _should_ only occur when reading regular files. But perhaps you are
experiencing something similar.

NIels


2010/11/6 Adam Phelps <am...@opendns.com>

> Yeah, it wasn't the combiner.  The repeated entries are actually seen by
> the mapper, so before the combiner comes into play.  Is there some other
> info that would be useful in getting clues as to what is causing this?
>
> - Adam
>
>
> On 11/5/10 11:35 AM, Adam Phelps wrote:
>
>> No, the system actually is much larger than two nodes. But the number of
>> mappers used here tends to be fairly small (I suspect based on the HBase
>> regions being accessed but usually more than two), I'll try turning off
>> the combiner to see if that changes anything.
>>
>> Thanks
>> - Adam
>>
>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>
>>> Hi,
>>>
>>> I don't know the answer (simply not enough information in your email)
>>> but I'm willing to make a guess:
>>> You are running on a system with two processing nodes?
>>> If so then try removing the Combiner. The combiner is a performance
>>> optimization and the whole processing should work without it.
>>> Some times there is a design fault in the processing and the combiner
>>> disrupts the processing.
>>>
>>> HTH
>>>
>>> Niels Basjes
>>>
>>> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>>>
>>> I've noticed an odd behavior with a map-reduce job I've written
>>> which is reading data out of an HBase table. After a couple days of
>>> poking at this I haven't been able to figure out the cause of the
>>> problem, so I figured I'd ask on here.
>>>
>>> (For reference I'm running with the cdh3b2 release)
>>>
>>> The problem is that it seems that every line from the HBase table is
>>> passed to the mappers twice, thus resulting in counts ending up as
>>> exactly double what they should be.
>>>
>>> I set up the job like this:
>>>
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>
>>> TableMapReduceUtil.initTableMapperJob(table,
>>> scan,
>>> mapper,
>>> Text.class,
>>> LongWritable.class,
>>> job);
>>> job.setCombinerClass(LongSumReducer.class);
>>>
>>> job.setReducerClass(reducer);
>>>
>>> I've set up counters in the mapper to verify what is happening, so
>>> that I know for certain that the mapper is being called twice with
>>> the same bit of data. I've also confirmed (using the hbase shell)
>>> that each entry appears only once in the table.
>>>
>>> Is there a known bug along these lines? If not, does anyone have
>>> any thoughts on what might be causing this or where I'd start
>>> looking to diagnose?
>>>
>>> Thanks
>>> - Adam
>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>>
>>
>>
>


-- 
Met vriendelijke groeten,

Niels Basjes

Re: Duplicated entries with map job reading from HBase

Posted by Adam Phelps <am...@opendns.com>.

I just experimentally tried that since I don't have any better ideas, 
but that didn't affect this.  My only current speculation is that 
somehow the older table has something messed up in its metadata, and so 
I'm just going to try copying its data to a new table and go from there.

- Adam

On 11/8/10 5:19 PM, Buttler, David wrote:
> Could it be speculative execution?  You might want to ensure that is turned off.
> Dave
>
> -----Original Message-----
> From: Adam Phelps [mailto:amp@opendns.com]
> Sent: Monday, November 08, 2010 4:30 PM
> To: mapreduce-user@hadoop.apache.org; user@hbase.apache.org
> Subject: Re: Duplicated entries with map job reading from HBase
>
> Ok, poked around at this a little more with a few experiments.
>
> The most interesting one is that I ran a a couple of the jobs that
> generate this data in HBase, one for the existing table I had seen the
> problem on and one for a new table with the same configuration as the
> old one.
>
> When the analysis job is run reading from HBase the counts are only
> doubled against the older table, using the new table as input produces
> the correct results.
>
> When doing this I also noticed that when using the new table only a
> single mapper is created, however for the old table two mappers are
> created (I checked and the data comes from only a single region in
> either case).
>
> So something is causing each hbase entry to be passed to a mapper twice
> on the older table, but only once on the newer table.
>
> Anyone have further thoughts on this?  I'm basically at the end of my
> ideas on figuring this out.
>
> - Adam
>
> On 11/5/10 4:01 PM, Adam Phelps wrote:
>> Yeah, it wasn't the combiner. The repeated entries are actually seen by
>> the mapper, so before the combiner comes into play. Is there some other
>> info that would be useful in getting clues as to what is causing this?
>>
>> - Adam
>>
>> On 11/5/10 11:35 AM, Adam Phelps wrote:
>>> No, the system actually is much larger than two nodes. But the number of
>>> mappers used here tends to be fairly small (I suspect based on the HBase
>>> regions being accessed but usually more than two), I'll try turning off
>>> the combiner to see if that changes anything.
>>>
>>> Thanks
>>> - Adam
>>>
>>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>>> Hi,
>>>>
>>>> I don't know the answer (simply not enough information in your email)
>>>> but I'm willing to make a guess:
>>>> You are running on a system with two processing nodes?
>>>> If so then try removing the Combiner. The combiner is a performance
>>>> optimization and the whole processing should work without it.
>>>> Some times there is a design fault in the processing and the combiner
>>>> disrupts the processing.
>>>>
>>>> HTH
>>>>
>>>> Niels Basjes
>>>>
>>>> 2010/11/5 Adam Phelps<am...@opendns.com>>
>>>>
>>>> I've noticed an odd behavior with a map-reduce job I've written
>>>> which is reading data out of an HBase table. After a couple days of
>>>> poking at this I haven't been able to figure out the cause of the
>>>> problem, so I figured I'd ask on here.
>>>>
>>>> (For reference I'm running with the cdh3b2 release)
>>>>
>>>> The problem is that it seems that every line from the HBase table is
>>>> passed to the mappers twice, thus resulting in counts ending up as
>>>> exactly double what they should be.
>>>>
>>>> I set up the job like this:
>>>>
>>>> Scan scan = new Scan();
>>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>>
>>>> TableMapReduceUtil.initTableMapperJob(table,
>>>> scan,
>>>> mapper,
>>>> Text.class,
>>>> LongWritable.class,
>>>> job);
>>>> job.setCombinerClass(LongSumReducer.class);
>>>>
>>>> job.setReducerClass(reducer);
>>>>
>>>> I've set up counters in the mapper to verify what is happening, so
>>>> that I know for certain that the mapper is being called twice with
>>>> the same bit of data. I've also confirmed (using the hbase shell)
>>>> that each entry appears only once in the table.
>>>>
>>>> Is there a known bug along these lines? If not, does anyone have
>>>> any thoughts on what might be causing this or where I'd start
>>>> looking to diagnose?
>>>>
>>>> Thanks
>>>> - Adam
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Met vriendelijke groeten,
>>>>
>>>> Niels Basjes
>>>
>>
>

RE: Duplicated entries with map job reading from HBase

Posted by "Buttler, David" <bu...@llnl.gov>.

Could it be speculative execution?  You might want to ensure that is turned off.
Dave

-----Original Message-----
From: Adam Phelps [mailto:amp@opendns.com] 
Sent: Monday, November 08, 2010 4:30 PM
To: mapreduce-user@hadoop.apache.org; user@hbase.apache.org
Subject: Re: Duplicated entries with map job reading from HBase

Ok, poked around at this a little more with a few experiments.

The most interesting one is that I ran a a couple of the jobs that 
generate this data in HBase, one for the existing table I had seen the 
problem on and one for a new table with the same configuration as the 
old one.

When the analysis job is run reading from HBase the counts are only 
doubled against the older table, using the new table as input produces 
the correct results.

When doing this I also noticed that when using the new table only a 
single mapper is created, however for the old table two mappers are 
created (I checked and the data comes from only a single region in 
either case).

So something is causing each hbase entry to be passed to a mapper twice 
on the older table, but only once on the newer table.

Anyone have further thoughts on this?  I'm basically at the end of my 
ideas on figuring this out.

- Adam

On 11/5/10 4:01 PM, Adam Phelps wrote:
> Yeah, it wasn't the combiner. The repeated entries are actually seen by
> the mapper, so before the combiner comes into play. Is there some other
> info that would be useful in getting clues as to what is causing this?
>
> - Adam
>
> On 11/5/10 11:35 AM, Adam Phelps wrote:
>> No, the system actually is much larger than two nodes. But the number of
>> mappers used here tends to be fairly small (I suspect based on the HBase
>> regions being accessed but usually more than two), I'll try turning off
>> the combiner to see if that changes anything.
>>
>> Thanks
>> - Adam
>>
>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>> Hi,
>>>
>>> I don't know the answer (simply not enough information in your email)
>>> but I'm willing to make a guess:
>>> You are running on a system with two processing nodes?
>>> If so then try removing the Combiner. The combiner is a performance
>>> optimization and the whole processing should work without it.
>>> Some times there is a design fault in the processing and the combiner
>>> disrupts the processing.
>>>
>>> HTH
>>>
>>> Niels Basjes
>>>
>>> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>>>
>>> I've noticed an odd behavior with a map-reduce job I've written
>>> which is reading data out of an HBase table. After a couple days of
>>> poking at this I haven't been able to figure out the cause of the
>>> problem, so I figured I'd ask on here.
>>>
>>> (For reference I'm running with the cdh3b2 release)
>>>
>>> The problem is that it seems that every line from the HBase table is
>>> passed to the mappers twice, thus resulting in counts ending up as
>>> exactly double what they should be.
>>>
>>> I set up the job like this:
>>>
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>
>>> TableMapReduceUtil.initTableMapperJob(table,
>>> scan,
>>> mapper,
>>> Text.class,
>>> LongWritable.class,
>>> job);
>>> job.setCombinerClass(LongSumReducer.class);
>>>
>>> job.setReducerClass(reducer);
>>>
>>> I've set up counters in the mapper to verify what is happening, so
>>> that I know for certain that the mapper is being called twice with
>>> the same bit of data. I've also confirmed (using the hbase shell)
>>> that each entry appears only once in the table.
>>>
>>> Is there a known bug along these lines? If not, does anyone have
>>> any thoughts on what might be causing this or where I'd start
>>> looking to diagnose?
>>>
>>> Thanks
>>> - Adam
>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>
>

Re: Duplicated entries with map job reading from HBase

Posted by Adam Phelps <am...@opendns.com>.

That had been my initial thought, however dumping the data from hbase 
shell only found single entries.  Further with the experiment I ran 
yesterday (generating the data to a new table as well as the old one) 
the entries being created should have been identical for each table.

The only thing I can think of here is that some metadata for the 
original table has been messed up in such a way that its being double 
processed as input for the mapper, but I have no idea about where to 
look for that.

- Adam

On 11/9/10 4:20 AM, Biedermann,S.,Fa. Post Direkt wrote:
> Hi Adam,
>
> Is it possible that you have double entries in your old table (two entries for the same (column family, column, timestamp) tuple)?
>
> Sven
>
> -----Ursprüngliche Nachricht-----
> Von: Adam Phelps [mailto:amp@opendns.com]
> Gesendet: Dienstag, 9. November 2010 01:30
> An: mapreduce-user@hadoop.apache.org; user@hbase.apache.org
> Betreff: Re: Duplicated entries with map job reading from HBase
>
> Ok, poked around at this a little more with a few experiments.
>
> The most interesting one is that I ran a a couple of the jobs that generate this data in HBase, one for the existing table I had seen the problem on and one for a new table with the same configuration as the old one.
>
> When the analysis job is run reading from HBase the counts are only doubled against the older table, using the new table as input produces the correct results.
>
> When doing this I also noticed that when using the new table only a single mapper is created, however for the old table two mappers are created (I checked and the data comes from only a single region in either case).
>
> So something is causing each hbase entry to be passed to a mapper twice on the older table, but only once on the newer table.
>
> Anyone have further thoughts on this?  I'm basically at the end of my ideas on figuring this out.
>
> - Adam
>
> On 11/5/10 4:01 PM, Adam Phelps wrote:
>> Yeah, it wasn't the combiner. The repeated entries are actually seen
>> by the mapper, so before the combiner comes into play. Is there some
>> other info that would be useful in getting clues as to what is causing this?
>>
>> - Adam
>>
>> On 11/5/10 11:35 AM, Adam Phelps wrote:
>>> No, the system actually is much larger than two nodes. But the number
>>> of mappers used here tends to be fairly small (I suspect based on the
>>> HBase regions being accessed but usually more than two), I'll try
>>> turning off the combiner to see if that changes anything.
>>>
>>> Thanks
>>> - Adam
>>>
>>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>>> Hi,
>>>>
>>>> I don't know the answer (simply not enough information in your
>>>> email) but I'm willing to make a guess:
>>>> You are running on a system with two processing nodes?
>>>> If so then try removing the Combiner. The combiner is a performance
>>>> optimization and the whole processing should work without it.
>>>> Some times there is a design fault in the processing and the
>>>> combiner disrupts the processing.
>>>>
>>>> HTH
>>>>
>>>> Niels Basjes
>>>>
>>>> 2010/11/5 Adam Phelps<am...@opendns.com>>
>>>>
>>>> I've noticed an odd behavior with a map-reduce job I've written
>>>> which is reading data out of an HBase table. After a couple days of
>>>> poking at this I haven't been able to figure out the cause of the
>>>> problem, so I figured I'd ask on here.
>>>>
>>>> (For reference I'm running with the cdh3b2 release)
>>>>
>>>> The problem is that it seems that every line from the HBase table is
>>>> passed to the mappers twice, thus resulting in counts ending up as
>>>> exactly double what they should be.
>>>>
>>>> I set up the job like this:
>>>>
>>>> Scan scan = new Scan();
>>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>>
>>>> TableMapReduceUtil.initTableMapperJob(table,
>>>> scan,
>>>> mapper,
>>>> Text.class,
>>>> LongWritable.class,
>>>> job);
>>>> job.setCombinerClass(LongSumReducer.class);
>>>>
>>>> job.setReducerClass(reducer);
>>>>
>>>> I've set up counters in the mapper to verify what is happening, so
>>>> that I know for certain that the mapper is being called twice with
>>>> the same bit of data. I've also confirmed (using the hbase shell)
>>>> that each entry appears only once in the table.
>>>>
>>>> Is there a known bug along these lines? If not, does anyone have any
>>>> thoughts on what might be causing this or where I'd start looking to
>>>> diagnose?
>>>>
>>>> Thanks
>>>> - Adam
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Met vriendelijke groeten,
>>>>
>>>> Niels Basjes
>>>
>>
>

AW: Duplicated entries with map job reading from HBase

Posted by "Biedermann,S.,Fa. Post Direkt" <S....@postdirekt.de>.

Hi Adam,

Is it possible that you have double entries in your old table (two entries for the same (column family, column, timestamp) tuple)?

Sven

-----Ursprüngliche Nachricht-----
Von: Adam Phelps [mailto:amp@opendns.com] 
Gesendet: Dienstag, 9. November 2010 01:30
An: mapreduce-user@hadoop.apache.org; user@hbase.apache.org
Betreff: Re: Duplicated entries with map job reading from HBase

Ok, poked around at this a little more with a few experiments.

The most interesting one is that I ran a a couple of the jobs that generate this data in HBase, one for the existing table I had seen the problem on and one for a new table with the same configuration as the old one.

When the analysis job is run reading from HBase the counts are only doubled against the older table, using the new table as input produces the correct results.

When doing this I also noticed that when using the new table only a single mapper is created, however for the old table two mappers are created (I checked and the data comes from only a single region in either case).

So something is causing each hbase entry to be passed to a mapper twice on the older table, but only once on the newer table.

Anyone have further thoughts on this?  I'm basically at the end of my ideas on figuring this out.

- Adam

On 11/5/10 4:01 PM, Adam Phelps wrote:
> Yeah, it wasn't the combiner. The repeated entries are actually seen 
> by the mapper, so before the combiner comes into play. Is there some 
> other info that would be useful in getting clues as to what is causing this?
>
> - Adam
>
> On 11/5/10 11:35 AM, Adam Phelps wrote:
>> No, the system actually is much larger than two nodes. But the number 
>> of mappers used here tends to be fairly small (I suspect based on the 
>> HBase regions being accessed but usually more than two), I'll try 
>> turning off the combiner to see if that changes anything.
>>
>> Thanks
>> - Adam
>>
>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>> Hi,
>>>
>>> I don't know the answer (simply not enough information in your 
>>> email) but I'm willing to make a guess:
>>> You are running on a system with two processing nodes?
>>> If so then try removing the Combiner. The combiner is a performance 
>>> optimization and the whole processing should work without it.
>>> Some times there is a design fault in the processing and the 
>>> combiner disrupts the processing.
>>>
>>> HTH
>>>
>>> Niels Basjes
>>>
>>> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>>>
>>> I've noticed an odd behavior with a map-reduce job I've written 
>>> which is reading data out of an HBase table. After a couple days of 
>>> poking at this I haven't been able to figure out the cause of the 
>>> problem, so I figured I'd ask on here.
>>>
>>> (For reference I'm running with the cdh3b2 release)
>>>
>>> The problem is that it seems that every line from the HBase table is 
>>> passed to the mappers twice, thus resulting in counts ending up as 
>>> exactly double what they should be.
>>>
>>> I set up the job like this:
>>>
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>
>>> TableMapReduceUtil.initTableMapperJob(table,
>>> scan,
>>> mapper,
>>> Text.class,
>>> LongWritable.class,
>>> job);
>>> job.setCombinerClass(LongSumReducer.class);
>>>
>>> job.setReducerClass(reducer);
>>>
>>> I've set up counters in the mapper to verify what is happening, so 
>>> that I know for certain that the mapper is being called twice with 
>>> the same bit of data. I've also confirmed (using the hbase shell) 
>>> that each entry appears only once in the table.
>>>
>>> Is there a known bug along these lines? If not, does anyone have any 
>>> thoughts on what might be causing this or where I'd start looking to 
>>> diagnose?
>>>
>>> Thanks
>>> - Adam
>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>
>

Re: Duplicated entries with map job reading from HBase

Posted by Adam Phelps <am...@opendns.com>.

Ok, poked around at this a little more with a few experiments.

The most interesting one is that I ran a a couple of the jobs that 
generate this data in HBase, one for the existing table I had seen the 
problem on and one for a new table with the same configuration as the 
old one.

When the analysis job is run reading from HBase the counts are only 
doubled against the older table, using the new table as input produces 
the correct results.

When doing this I also noticed that when using the new table only a 
single mapper is created, however for the old table two mappers are 
created (I checked and the data comes from only a single region in 
either case).

So something is causing each hbase entry to be passed to a mapper twice 
on the older table, but only once on the newer table.

Anyone have further thoughts on this?  I'm basically at the end of my 
ideas on figuring this out.

- Adam

On 11/5/10 4:01 PM, Adam Phelps wrote:
> Yeah, it wasn't the combiner. The repeated entries are actually seen by
> the mapper, so before the combiner comes into play. Is there some other
> info that would be useful in getting clues as to what is causing this?
>
> - Adam
>
> On 11/5/10 11:35 AM, Adam Phelps wrote:
>> No, the system actually is much larger than two nodes. But the number of
>> mappers used here tends to be fairly small (I suspect based on the HBase
>> regions being accessed but usually more than two), I'll try turning off
>> the combiner to see if that changes anything.
>>
>> Thanks
>> - Adam
>>
>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>> Hi,
>>>
>>> I don't know the answer (simply not enough information in your email)
>>> but I'm willing to make a guess:
>>> You are running on a system with two processing nodes?
>>> If so then try removing the Combiner. The combiner is a performance
>>> optimization and the whole processing should work without it.
>>> Some times there is a design fault in the processing and the combiner
>>> disrupts the processing.
>>>
>>> HTH
>>>
>>> Niels Basjes
>>>
>>> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>>>
>>> I've noticed an odd behavior with a map-reduce job I've written
>>> which is reading data out of an HBase table. After a couple days of
>>> poking at this I haven't been able to figure out the cause of the
>>> problem, so I figured I'd ask on here.
>>>
>>> (For reference I'm running with the cdh3b2 release)
>>>
>>> The problem is that it seems that every line from the HBase table is
>>> passed to the mappers twice, thus resulting in counts ending up as
>>> exactly double what they should be.
>>>
>>> I set up the job like this:
>>>
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>
>>> TableMapReduceUtil.initTableMapperJob(table,
>>> scan,
>>> mapper,
>>> Text.class,
>>> LongWritable.class,
>>> job);
>>> job.setCombinerClass(LongSumReducer.class);
>>>
>>> job.setReducerClass(reducer);
>>>
>>> I've set up counters in the mapper to verify what is happening, so
>>> that I know for certain that the mapper is being called twice with
>>> the same bit of data. I've also confirmed (using the hbase shell)
>>> that each entry appears only once in the table.
>>>
>>> Is there a known bug along these lines? If not, does anyone have
>>> any thoughts on what might be causing this or where I'd start
>>> looking to diagnose?
>>>
>>> Thanks
>>> - Adam
>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>
>

Re: Duplicated entries with map job reading from HBase

Posted by Adam Phelps <am...@opendns.com>.

Ok, poked around at this a little more with a few experiments.

The most interesting one is that I ran a a couple of the jobs that 
generate this data in HBase, one for the existing table I had seen the 
problem on and one for a new table with the same configuration as the 
old one.

When the analysis job is run reading from HBase the counts are only 
doubled against the older table, using the new table as input produces 
the correct results.

When doing this I also noticed that when using the new table only a 
single mapper is created, however for the old table two mappers are 
created (I checked and the data comes from only a single region in 
either case).

So something is causing each hbase entry to be passed to a mapper twice 
on the older table, but only once on the newer table.

Anyone have further thoughts on this?  I'm basically at the end of my 
ideas on figuring this out.

- Adam

On 11/5/10 4:01 PM, Adam Phelps wrote:
> Yeah, it wasn't the combiner. The repeated entries are actually seen by
> the mapper, so before the combiner comes into play. Is there some other
> info that would be useful in getting clues as to what is causing this?
>
> - Adam
>
> On 11/5/10 11:35 AM, Adam Phelps wrote:
>> No, the system actually is much larger than two nodes. But the number of
>> mappers used here tends to be fairly small (I suspect based on the HBase
>> regions being accessed but usually more than two), I'll try turning off
>> the combiner to see if that changes anything.
>>
>> Thanks
>> - Adam
>>
>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>> Hi,
>>>
>>> I don't know the answer (simply not enough information in your email)
>>> but I'm willing to make a guess:
>>> You are running on a system with two processing nodes?
>>> If so then try removing the Combiner. The combiner is a performance
>>> optimization and the whole processing should work without it.
>>> Some times there is a design fault in the processing and the combiner
>>> disrupts the processing.
>>>
>>> HTH
>>>
>>> Niels Basjes
>>>
>>> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>>>
>>> I've noticed an odd behavior with a map-reduce job I've written
>>> which is reading data out of an HBase table. After a couple days of
>>> poking at this I haven't been able to figure out the cause of the
>>> problem, so I figured I'd ask on here.
>>>
>>> (For reference I'm running with the cdh3b2 release)
>>>
>>> The problem is that it seems that every line from the HBase table is
>>> passed to the mappers twice, thus resulting in counts ending up as
>>> exactly double what they should be.
>>>
>>> I set up the job like this:
>>>
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>
>>> TableMapReduceUtil.initTableMapperJob(table,
>>> scan,
>>> mapper,
>>> Text.class,
>>> LongWritable.class,
>>> job);
>>> job.setCombinerClass(LongSumReducer.class);
>>>
>>> job.setReducerClass(reducer);
>>>
>>> I've set up counters in the mapper to verify what is happening, so
>>> that I know for certain that the mapper is being called twice with
>>> the same bit of data. I've also confirmed (using the hbase shell)
>>> that each entry appears only once in the table.
>>>
>>> Is there a known bug along these lines? If not, does anyone have
>>> any thoughts on what might be causing this or where I'd start
>>> looking to diagnose?
>>>
>>> Thanks
>>> - Adam
>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>
>

Re: Duplicated entries with map job reading from HBase

Posted by Adam Phelps <am...@opendns.com>.

Yeah, it wasn't the combiner.  The repeated entries are actually seen by 
the mapper, so before the combiner comes into play.  Is there some other 
info that would be useful in getting clues as to what is causing this?

- Adam

On 11/5/10 11:35 AM, Adam Phelps wrote:
> No, the system actually is much larger than two nodes. But the number of
> mappers used here tends to be fairly small (I suspect based on the HBase
> regions being accessed but usually more than two), I'll try turning off
> the combiner to see if that changes anything.
>
> Thanks
> - Adam
>
> On 11/5/10 9:23 AM, Niels Basjes wrote:
>> Hi,
>>
>> I don't know the answer (simply not enough information in your email)
>> but I'm willing to make a guess:
>> You are running on a system with two processing nodes?
>> If so then try removing the Combiner. The combiner is a performance
>> optimization and the whole processing should work without it.
>> Some times there is a design fault in the processing and the combiner
>> disrupts the processing.
>>
>> HTH
>>
>> Niels Basjes
>>
>> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>>
>> I've noticed an odd behavior with a map-reduce job I've written
>> which is reading data out of an HBase table. After a couple days of
>> poking at this I haven't been able to figure out the cause of the
>> problem, so I figured I'd ask on here.
>>
>> (For reference I'm running with the cdh3b2 release)
>>
>> The problem is that it seems that every line from the HBase table is
>> passed to the mappers twice, thus resulting in counts ending up as
>> exactly double what they should be.
>>
>> I set up the job like this:
>>
>> Scan scan = new Scan();
>> scan.addFamily(Bytes.toBytes(scanFamily));
>>
>> TableMapReduceUtil.initTableMapperJob(table,
>> scan,
>> mapper,
>> Text.class,
>> LongWritable.class,
>> job);
>> job.setCombinerClass(LongSumReducer.class);
>>
>> job.setReducerClass(reducer);
>>
>> I've set up counters in the mapper to verify what is happening, so
>> that I know for certain that the mapper is being called twice with
>> the same bit of data. I've also confirmed (using the hbase shell)
>> that each entry appears only once in the table.
>>
>> Is there a known bug along these lines? If not, does anyone have
>> any thoughts on what might be causing this or where I'd start
>> looking to diagnose?
>>
>> Thanks
>> - Adam
>>
>>
>>
>>
>> --
>> Met vriendelijke groeten,
>>
>> Niels Basjes
>

Re: Duplicated entries with map job reading from HBase

Posted by Adam Phelps <am...@opendns.com>.

No, the system actually is much larger than two nodes.  But the number 
of mappers used here tends to be fairly small (I suspect based on the 
HBase regions being accessed but usually more than two), I'll try 
turning off the combiner to see if that changes anything.

Thanks
- Adam

On 11/5/10 9:23 AM, Niels Basjes wrote:
> Hi,
>
> I don't know the answer (simply not enough information in your email)
> but I'm willing to make a guess:
> You are running on a system with two processing nodes?
> If so then try removing the Combiner. The combiner is a performance
> optimization and the whole processing should work without it.
> Some times there is a design fault in the processing and the combiner
> disrupts the processing.
>
> HTH
>
> Niels Basjes
>
> 2010/11/5 Adam Phelps <amp@opendns.com <ma...@opendns.com>>
>
>     I've noticed an odd behavior with a map-reduce job I've written
>     which is reading data out of an HBase table.  After a couple days of
>     poking at this I haven't been able to figure out the cause of the
>     problem, so I figured I'd ask on here.
>
>     (For reference I'm running with the cdh3b2 release)
>
>     The problem is that it seems that every line from the HBase table is
>     passed to the mappers twice, thus resulting in counts ending up as
>     exactly double what they should be.
>
>     I set up the job like this:
>
>                 Scan scan = new Scan();
>                 scan.addFamily(Bytes.toBytes(scanFamily));
>
>                 TableMapReduceUtil.initTableMapperJob(table,
>                                                       scan,
>                                                       mapper,
>                                                       Text.class,
>                                                       LongWritable.class,
>                                                       job);
>                 job.setCombinerClass(LongSumReducer.class);
>
>                 job.setReducerClass(reducer);
>
>     I've set up counters in the mapper to verify what is happening, so
>     that I know for certain that the mapper is being called twice with
>     the same bit of data.  I've also confirmed (using the hbase shell)
>     that each entry appears only once in the table.
>
>     Is there a known bug along these lines?  If not, does anyone have
>     any thoughts on what might be causing this or where I'd start
>     looking to diagnose?
>
>     Thanks
>     - Adam
>
>
>
>
> --
> Met vriendelijke groeten,
>
> Niels Basjes

Re: Duplicated entries with map job reading from HBase

Posted by Niels Basjes <ni...@basj.es>.

Hi,

I don't know the answer (simply not enough information in your email) but
I'm willing to make a guess:
You are running on a system with two processing nodes?
If so then try removing the Combiner. The combiner is a performance
optimization and the whole processing should work without it.
Some times there is a design fault in the processing and the combiner
disrupts the processing.

HTH

Niels Basjes

2010/11/5 Adam Phelps <am...@opendns.com>

> I've noticed an odd behavior with a map-reduce job I've written which is
> reading data out of an HBase table.  After a couple days of poking at this I
> haven't been able to figure out the cause of the problem, so I figured I'd
> ask on here.
>
> (For reference I'm running with the cdh3b2 release)
>
> The problem is that it seems that every line from the HBase table is passed
> to the mappers twice, thus resulting in counts ending up as exactly double
> what they should be.
>
> I set up the job like this:
>
>            Scan scan = new Scan();
>            scan.addFamily(Bytes.toBytes(scanFamily));
>
>            TableMapReduceUtil.initTableMapperJob(table,
>                                                  scan,
>                                                  mapper,
>                                                  Text.class,
>                                                  LongWritable.class,
>                                                  job);
>            job.setCombinerClass(LongSumReducer.class);
>
>            job.setReducerClass(reducer);
>
> I've set up counters in the mapper to verify what is happening, so that I
> know for certain that the mapper is being called twice with the same bit of
> data.  I've also confirmed (using the hbase shell) that each entry appears
> only once in the table.
>
> Is there a known bug along these lines?  If not, does anyone have any
> thoughts on what might be causing this or where I'd start looking to
> diagnose?
>
> Thanks
> - Adam
>



-- 
Met vriendelijke groeten,

Niels Basjes