You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Dru Jensen <dr...@gmail.com> on 2008/09/05 21:03:16 UTC

missing rows in MR process

hbase-users,

I have two MR processes that run one right after the other in a  
script.  The first reads from a file and populates a table.  The  
second uses a TableMap over that table that was just populated.

The first MR process inserted 1950 rows successfully and everything  
looked correct.  For some reason the second MR process only got 76  
rows as input.  I ran the exact same MR process and the second time it  
got all 1950 rows.

Is there some time delay between the MR batch update of the first  
process and the scan of the second?  How can i make sure this commit  
is complete before launching the second MR process?

This is using the Release Candidate 0.2.1 running on Hadoop 0.17.2.1.

thanks,
Dru

Re: missing rows in MR process

Posted by Dru Jensen <dr...@gmail.com>.

Aaahh Yes. Thanks.

On Sep 8, 2008, at 10:59 AM, stack wrote:

> Dru Jensen wrote:
>> Hi StAck,
>>
>> No, i don't think I'm hitting this.  The first MR process is using  
>> in: SequenceInputFileFormat out: TableReduce.  The second is using  
>> in: TableMap out TableReduce.  I don't think the out-of-the-box  
>> TableMap is using a filter, correct?
>>
> It looks like it is.
>
> The TableMap job makes a task per region by default.  Each task then  
> runs a scanner whose compass is defined by the region start/end  
> row.  When you get a scanner specifying a start/end row, it  
> eventually does the following:
>
> public Scanner getScanner(final byte [][] columns,
>   final byte [] startRow, final byte [] stopRow, final long timestamp)
> throws IOException {
>   return getScanner(columns, startRow, timestamp,
>     new WhileMatchRowFilter(new StopRowFilter(stopRow)));
> }
>
> ... i.e. put in place a StopRowFilter.
>
> So, maybe you are tripping over 856.
>
> St.Ack
>
>
>> Dru
>>
>> On Sep 5, 2008, at 3:59 PM, stack wrote:
>>
>>> This is odd Dru.  Do you think you are seeing https://issues.apache.org/jira/browse/HBASE-856? 
>>>   Are you using filters?
>>> St.Ack
>>>
>>>
>>> Dru Jensen wrote:
>>>> hbase-users,
>>>>
>>>> I have two MR processes that run one right after the other in a  
>>>> script.  The first reads from a file and populates a table.  The  
>>>> second uses a TableMap over that table that was just populated.
>>>>
>>>> The first MR process inserted 1950 rows successfully and  
>>>> everything looked correct.  For some reason the second MR process  
>>>> only got 76 rows as input.  I ran the exact same MR process and  
>>>> the second time it got all 1950 rows.
>>>>
>>>> Is there some time delay between the MR batch update of the first  
>>>> process and the scan of the second?  How can i make sure this  
>>>> commit is complete before launching the second MR process?
>>>>
>>>> This is using the Release Candidate 0.2.1 running on Hadoop  
>>>> 0.17.2.1.
>>>>
>>>> thanks,
>>>> Dru
>>>>
>>>>
>>>
>>
>

Re: missing rows in MR process

Posted by stack <st...@duboce.net>.

Dru Jensen wrote:
> Hi StAck,
>
> No, i don't think I'm hitting this.  The first MR process is using in: 
> SequenceInputFileFormat out: TableReduce.  The second is using in: 
> TableMap out TableReduce.  I don't think the out-of-the-box TableMap 
> is using a filter, correct?
>
It looks like it is.

The TableMap job makes a task per region by default.  Each task then 
runs a scanner whose compass is defined by the region start/end row.  
When you get a scanner specifying a start/end row, it eventually does 
the following:

  public Scanner getScanner(final byte [][] columns,
    final byte [] startRow, final byte [] stopRow, final long timestamp)
  throws IOException {
    return getScanner(columns, startRow, timestamp,
      new WhileMatchRowFilter(new StopRowFilter(stopRow)));
  }

... i.e. put in place a StopRowFilter.

So, maybe you are tripping over 856.

St.Ack


> Dru
>
> On Sep 5, 2008, at 3:59 PM, stack wrote:
>
>> This is odd Dru.  Do you think you are seeing 
>> https://issues.apache.org/jira/browse/HBASE-856?  Are you using filters?
>> St.Ack
>>
>>
>> Dru Jensen wrote:
>>> hbase-users,
>>>
>>> I have two MR processes that run one right after the other in a 
>>> script.  The first reads from a file and populates a table.  The 
>>> second uses a TableMap over that table that was just populated.
>>>
>>> The first MR process inserted 1950 rows successfully and everything 
>>> looked correct.  For some reason the second MR process only got 76 
>>> rows as input.  I ran the exact same MR process and the second time 
>>> it got all 1950 rows.
>>>
>>> Is there some time delay between the MR batch update of the first 
>>> process and the scan of the second?  How can i make sure this commit 
>>> is complete before launching the second MR process?
>>>
>>> This is using the Release Candidate 0.2.1 running on Hadoop 0.17.2.1.
>>>
>>> thanks,
>>> Dru
>>>
>>>
>>
>

Re: missing rows in MR process

Posted by Dru Jensen <dr...@gmail.com>.

Hi StAck,

No, i don't think I'm hitting this.  The first MR process is using in:  
SequenceInputFileFormat out: TableReduce.  The second is using in:  
TableMap out TableReduce.  I don't think the out-of-the-box TableMap  
is using a filter, correct?

Dru

On Sep 5, 2008, at 3:59 PM, stack wrote:

> This is odd Dru.  Do you think you are seeing https://issues.apache.org/jira/browse/HBASE-856? 
>   Are you using filters?
> St.Ack
>
>
> Dru Jensen wrote:
>> hbase-users,
>>
>> I have two MR processes that run one right after the other in a  
>> script.  The first reads from a file and populates a table.  The  
>> second uses a TableMap over that table that was just populated.
>>
>> The first MR process inserted 1950 rows successfully and everything  
>> looked correct.  For some reason the second MR process only got 76  
>> rows as input.  I ran the exact same MR process and the second time  
>> it got all 1950 rows.
>>
>> Is there some time delay between the MR batch update of the first  
>> process and the scan of the second?  How can i make sure this  
>> commit is complete before launching the second MR process?
>>
>> This is using the Release Candidate 0.2.1 running on Hadoop 0.17.2.1.
>>
>> thanks,
>> Dru
>>
>>
>

Re: missing rows in MR process

Posted by stack <st...@duboce.net>.

This is odd Dru.  Do you think you are seeing 
https://issues.apache.org/jira/browse/HBASE-856?  Are you using filters?
St.Ack


Dru Jensen wrote:
> hbase-users,
>
> I have two MR processes that run one right after the other in a 
> script.  The first reads from a file and populates a table.  The 
> second uses a TableMap over that table that was just populated.
>
> The first MR process inserted 1950 rows successfully and everything 
> looked correct.  For some reason the second MR process only got 76 
> rows as input.  I ran the exact same MR process and the second time it 
> got all 1950 rows.
>
> Is there some time delay between the MR batch update of the first 
> process and the scan of the second?  How can i make sure this commit 
> is complete before launching the second MR process?
>
> This is using the Release Candidate 0.2.1 running on Hadoop 0.17.2.1.
>
> thanks,
> Dru
>
>