You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Ian Stevens <i....@syncapse.com> on 2011/01/05 22:14:12 UTC

Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Hi everyone. In considering Pig for our HBase querying needs, I've run into a discrepancy between the size of Pig's result set and the size of the table being queried. I hope this is due to a misunderstanding of HBase and Pig on my part. The test case which generates the discrepancy is fairly simple, however.

The link below contains a Jython script which populates an HBase table with data in two column familes. A corresponding Pig query retrieves data for one column and saves it to a CSV:

https://gist.github.com/766929

The Jython script has the following usage:

> jython hbase_test.py [table] [column count] [row count] [batch count]

This will populate a table named [table] with two column families. The first contains static data. The second contains the given number of columns, populated with data.

The Pig query will return an inaccurate number of results for certain table sizes and configurations, most notably with tables exceeding 1.8 million rows in length and with more than 2 columns in the queried column family, eg.

> jython hbase_test.py test 3 1800000 100000

For instance, if I execute the above command and the corresponding Pig query, the results number 905914. Note that if the table is re-populated and queried a second time, a different number results. If I run the query again without re-populating the table, I get the same number of results. The HBase shell returns an accurate row count.

Some notes on reproducing this issue (or not):

* If the Jython script doesn't populate the meta column family, the issue goes away with the same query.
* If the Jython script populates 2 columns instead of 3, the issue goes away with the same query.
* The size of the column key or its value may influence whether the issue occurs.
For instance, if I change the script to store 'value_%d' instead of 'value_%d_%d', retaining the random int, the issue goes away with the same query.

I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard using the stock Java that came with the OS. Attached is a log of the Pig console output. The error logs contain nothing of import.

Am I doing anything incorrectly? Is there a way I can work around this issue without compromising the column family being queried?

This appears to be a fairly simple case of Pig/HBase usage. Can anyone else reproduce the issue?

thanks,
Ian.

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Ian, I looked through the code and I don't see how this could be happening..
just to make sure this isn't an HBase issue -- can you run an equivalent
java MR program to count the rows? The shell one is sequential and doesn't
use all the mapreduce machinery.

The job you want to run is org.apache.hadoop.hbase.mapreduce.RowCounter in
the hbase jar, I believe.

On Thu, Jan 6, 2011 at 10:53 AM, Ian Stevens <i....@syncapse.com> wrote:

> The regionserver.out is empty. The regionserver.log contains only the
> following for the relevant time period:
>
> Thu Jan  6 12:19:57 EST 2011 Starting regionserver on
> istevens.syncapse.local
> ulimit -n 256
> 2011-01-06 12:19:59,588 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct
> region server because hbase.cluster.distributed is false
>
> Ian.
>
> On 2011-01-06, at 1:32 PM, Dmitriy Ryaboy wrote:
>
> > Do you happen to have the region server logs as well?
> > The .out as well as .log
> >
> > D
> >
> > On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <i....@syncapse.com>
> wrote:
> >
> >> On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:
> >>
> >>> That certainly sounds like a bug. I wonder if there is anything
> >> interesting
> >>> in the HBase logs when you run the job that gets the wrong result?
> >>
> >> Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log
> from
> >> about the time of the failed query. I restarted HBase before making the
> >> query, so there might be noise in the log associated with a restart.
> >>
> >> master.log: http://pastebin.com/VwiXZ9BB
> >> zookeeper.log: http://pastebin.com/CnFVyFT2
> >>
> >> I believe logging level is set to DEBUG for both logs.
> >>
> >> Let me know if you need further logging.
> >>
> >> thanks,
> >> Ian.
> >>
> >>
> >>> On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <i....@syncapse.com>
> >> wrote:
> >>>
> >>>> Hi everyone. In considering Pig for our HBase querying needs, I've run
> >> into
> >>>> a discrepancy between the size of Pig's result set and the size of the
> >> table
> >>>> being queried. I hope this is due to a misunderstanding of HBase and
> Pig
> >> on
> >>>> my part. The test case which generates the discrepancy is fairly
> simple,
> >>>> however.
> >>>>
> >>>> The link below contains a Jython script which populates an HBase table
> >> with
> >>>> data in two column familes. A corresponding Pig query retrieves data
> for
> >> one
> >>>> column and saves it to a CSV:
> >>>>
> >>>> https://gist.github.com/766929
> >>>>
> >>>> The Jython script has the following usage:
> >>>>
> >>>>> jython hbase_test.py [table] [column count] [row count] [batch count]
> >>>>
> >>>> This will populate a table named [table] with two column families. The
> >>>> first contains static data. The second contains the given number of
> >> columns,
> >>>> populated with data.
> >>>>
> >>>> The Pig query will return an inaccurate number of results for certain
> >> table
> >>>> sizes and configurations, most notably with tables exceeding 1.8
> million
> >>>> rows in length and with more than 2 columns in the queried column
> >> family,
> >>>> eg.
> >>>>
> >>>>> jython hbase_test.py test 3 1800000 100000
> >>>>
> >>>> For instance, if I execute the above command and the corresponding Pig
> >>>> query, the results number 905914. Note that if the table is
> re-populated
> >> and
> >>>> queried a second time, a different number results. If I run the query
> >> again
> >>>> without re-populating the table, I get the same number of results. The
> >> HBase
> >>>> shell returns an accurate row count.
> >>>>
> >>>> Some notes on reproducing this issue (or not):
> >>>>
> >>>> * If the Jython script doesn't populate the meta column family, the
> >> issue
> >>>> goes away with the same query.
> >>>> * If the Jython script populates 2 columns instead of 3, the issue
> goes
> >>>> away with the same query.
> >>>> * The size of the column key or its value may influence whether the
> >> issue
> >>>> occurs.
> >>>>  For instance, if I change the script to store 'value_%d' instead of
> >>>> 'value_%d_%d', retaining the random int, the issue goes away with the
> >> same
> >>>> query.
> >>>>
> >>>> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow
> Leopard
> >>>> using the stock Java that came with the OS. Attached is a log of the
> Pig
> >>>> console output. The error logs contain nothing of import.
> >>>>
> >>>> Am I doing anything incorrectly? Is there a way I can work around this
> >>>> issue without compromising the column family being queried?
> >>>>
> >>>> This appears to be a fairly simple case of Pig/HBase usage. Can anyone
> >> else
> >>>> reproduce the issue?
> >>>>
> >>>> thanks,
> >>>> Ian.
> >>>>
> >>>>
> >>
> >>
>
>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Posted by Ian Stevens <i....@syncapse.com>.

The regionserver.out is empty. The regionserver.log contains only the following for the relevant time period:

Thu Jan  6 12:19:57 EST 2011 Starting regionserver on istevens.syncapse.local
ulimit -n 256
2011-01-06 12:19:59,588 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct region server because hbase.cluster.distributed is false

Ian.

On 2011-01-06, at 1:32 PM, Dmitriy Ryaboy wrote:

> Do you happen to have the region server logs as well?
> The .out as well as .log
> 
> D
> 
> On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <i....@syncapse.com> wrote:
> 
>> On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:
>> 
>>> That certainly sounds like a bug. I wonder if there is anything
>> interesting
>>> in the HBase logs when you run the job that gets the wrong result?
>> 
>> Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from
>> about the time of the failed query. I restarted HBase before making the
>> query, so there might be noise in the log associated with a restart.
>> 
>> master.log: http://pastebin.com/VwiXZ9BB
>> zookeeper.log: http://pastebin.com/CnFVyFT2
>> 
>> I believe logging level is set to DEBUG for both logs.
>> 
>> Let me know if you need further logging.
>> 
>> thanks,
>> Ian.
>> 
>> 
>>> On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <i....@syncapse.com>
>> wrote:
>>> 
>>>> Hi everyone. In considering Pig for our HBase querying needs, I've run
>> into
>>>> a discrepancy between the size of Pig's result set and the size of the
>> table
>>>> being queried. I hope this is due to a misunderstanding of HBase and Pig
>> on
>>>> my part. The test case which generates the discrepancy is fairly simple,
>>>> however.
>>>> 
>>>> The link below contains a Jython script which populates an HBase table
>> with
>>>> data in two column familes. A corresponding Pig query retrieves data for
>> one
>>>> column and saves it to a CSV:
>>>> 
>>>> https://gist.github.com/766929
>>>> 
>>>> The Jython script has the following usage:
>>>> 
>>>>> jython hbase_test.py [table] [column count] [row count] [batch count]
>>>> 
>>>> This will populate a table named [table] with two column families. The
>>>> first contains static data. The second contains the given number of
>> columns,
>>>> populated with data.
>>>> 
>>>> The Pig query will return an inaccurate number of results for certain
>> table
>>>> sizes and configurations, most notably with tables exceeding 1.8 million
>>>> rows in length and with more than 2 columns in the queried column
>> family,
>>>> eg.
>>>> 
>>>>> jython hbase_test.py test 3 1800000 100000
>>>> 
>>>> For instance, if I execute the above command and the corresponding Pig
>>>> query, the results number 905914. Note that if the table is re-populated
>> and
>>>> queried a second time, a different number results. If I run the query
>> again
>>>> without re-populating the table, I get the same number of results. The
>> HBase
>>>> shell returns an accurate row count.
>>>> 
>>>> Some notes on reproducing this issue (or not):
>>>> 
>>>> * If the Jython script doesn't populate the meta column family, the
>> issue
>>>> goes away with the same query.
>>>> * If the Jython script populates 2 columns instead of 3, the issue goes
>>>> away with the same query.
>>>> * The size of the column key or its value may influence whether the
>> issue
>>>> occurs.
>>>>  For instance, if I change the script to store 'value_%d' instead of
>>>> 'value_%d_%d', retaining the random int, the issue goes away with the
>> same
>>>> query.
>>>> 
>>>> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
>>>> using the stock Java that came with the OS. Attached is a log of the Pig
>>>> console output. The error logs contain nothing of import.
>>>> 
>>>> Am I doing anything incorrectly? Is there a way I can work around this
>>>> issue without compromising the column family being queried?
>>>> 
>>>> This appears to be a fairly simple case of Pig/HBase usage. Can anyone
>> else
>>>> reproduce the issue?
>>>> 
>>>> thanks,
>>>> Ian.
>>>> 
>>>> 
>> 
>>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Do you happen to have the region server logs as well?
The .out as well as .log

D

On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <i....@syncapse.com> wrote:

> On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:
>
> > That certainly sounds like a bug. I wonder if there is anything
> interesting
> > in the HBase logs when you run the job that gets the wrong result?
>
> Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from
> about the time of the failed query. I restarted HBase before making the
> query, so there might be noise in the log associated with a restart.
>
> master.log: http://pastebin.com/VwiXZ9BB
> zookeeper.log: http://pastebin.com/CnFVyFT2
>
> I believe logging level is set to DEBUG for both logs.
>
> Let me know if you need further logging.
>
> thanks,
> Ian.
>
>
> > On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <i....@syncapse.com>
> wrote:
> >
> >> Hi everyone. In considering Pig for our HBase querying needs, I've run
> into
> >> a discrepancy between the size of Pig's result set and the size of the
> table
> >> being queried. I hope this is due to a misunderstanding of HBase and Pig
> on
> >> my part. The test case which generates the discrepancy is fairly simple,
> >> however.
> >>
> >> The link below contains a Jython script which populates an HBase table
> with
> >> data in two column familes. A corresponding Pig query retrieves data for
> one
> >> column and saves it to a CSV:
> >>
> >> https://gist.github.com/766929
> >>
> >> The Jython script has the following usage:
> >>
> >>> jython hbase_test.py [table] [column count] [row count] [batch count]
> >>
> >> This will populate a table named [table] with two column families. The
> >> first contains static data. The second contains the given number of
> columns,
> >> populated with data.
> >>
> >> The Pig query will return an inaccurate number of results for certain
> table
> >> sizes and configurations, most notably with tables exceeding 1.8 million
> >> rows in length and with more than 2 columns in the queried column
> family,
> >> eg.
> >>
> >>> jython hbase_test.py test 3 1800000 100000
> >>
> >> For instance, if I execute the above command and the corresponding Pig
> >> query, the results number 905914. Note that if the table is re-populated
> and
> >> queried a second time, a different number results. If I run the query
> again
> >> without re-populating the table, I get the same number of results. The
> HBase
> >> shell returns an accurate row count.
> >>
> >> Some notes on reproducing this issue (or not):
> >>
> >> * If the Jython script doesn't populate the meta column family, the
> issue
> >> goes away with the same query.
> >> * If the Jython script populates 2 columns instead of 3, the issue goes
> >> away with the same query.
> >> * The size of the column key or its value may influence whether the
> issue
> >> occurs.
> >>   For instance, if I change the script to store 'value_%d' instead of
> >> 'value_%d_%d', retaining the random int, the issue goes away with the
> same
> >> query.
> >>
> >> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
> >> using the stock Java that came with the OS. Attached is a log of the Pig
> >> console output. The error logs contain nothing of import.
> >>
> >> Am I doing anything incorrectly? Is there a way I can work around this
> >> issue without compromising the column family being queried?
> >>
> >> This appears to be a fairly simple case of Pig/HBase usage. Can anyone
> else
> >> reproduce the issue?
> >>
> >> thanks,
> >> Ian.
> >>
> >>
>
>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Posted by Ian Stevens <i....@syncapse.com>.

On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:

> That certainly sounds like a bug. I wonder if there is anything interesting
> in the HBase logs when you run the job that gets the wrong result?

Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from about the time of the failed query. I restarted HBase before making the query, so there might be noise in the log associated with a restart.

master.log: http://pastebin.com/VwiXZ9BB
zookeeper.log: http://pastebin.com/CnFVyFT2

I believe logging level is set to DEBUG for both logs.

Let me know if you need further logging.

thanks,
Ian.


> On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <i....@syncapse.com> wrote:
> 
>> Hi everyone. In considering Pig for our HBase querying needs, I've run into
>> a discrepancy between the size of Pig's result set and the size of the table
>> being queried. I hope this is due to a misunderstanding of HBase and Pig on
>> my part. The test case which generates the discrepancy is fairly simple,
>> however.
>> 
>> The link below contains a Jython script which populates an HBase table with
>> data in two column familes. A corresponding Pig query retrieves data for one
>> column and saves it to a CSV:
>> 
>> https://gist.github.com/766929
>> 
>> The Jython script has the following usage:
>> 
>>> jython hbase_test.py [table] [column count] [row count] [batch count]
>> 
>> This will populate a table named [table] with two column families. The
>> first contains static data. The second contains the given number of columns,
>> populated with data.
>> 
>> The Pig query will return an inaccurate number of results for certain table
>> sizes and configurations, most notably with tables exceeding 1.8 million
>> rows in length and with more than 2 columns in the queried column family,
>> eg.
>> 
>>> jython hbase_test.py test 3 1800000 100000
>> 
>> For instance, if I execute the above command and the corresponding Pig
>> query, the results number 905914. Note that if the table is re-populated and
>> queried a second time, a different number results. If I run the query again
>> without re-populating the table, I get the same number of results. The HBase
>> shell returns an accurate row count.
>> 
>> Some notes on reproducing this issue (or not):
>> 
>> * If the Jython script doesn't populate the meta column family, the issue
>> goes away with the same query.
>> * If the Jython script populates 2 columns instead of 3, the issue goes
>> away with the same query.
>> * The size of the column key or its value may influence whether the issue
>> occurs.
>>   For instance, if I change the script to store 'value_%d' instead of
>> 'value_%d_%d', retaining the random int, the issue goes away with the same
>> query.
>> 
>> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
>> using the stock Java that came with the OS. Attached is a log of the Pig
>> console output. The error logs contain nothing of import.
>> 
>> Am I doing anything incorrectly? Is there a way I can work around this
>> issue without compromising the column family being queried?
>> 
>> This appears to be a fairly simple case of Pig/HBase usage. Can anyone else
>> reproduce the issue?
>> 
>> thanks,
>> Ian.
>> 
>>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

That certainly sounds like a bug. I wonder if there is anything interesting
in the HBase logs when you run the job that gets the wrong result?

On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <i....@syncapse.com> wrote:

> Hi everyone. In considering Pig for our HBase querying needs, I've run into
> a discrepancy between the size of Pig's result set and the size of the table
> being queried. I hope this is due to a misunderstanding of HBase and Pig on
> my part. The test case which generates the discrepancy is fairly simple,
> however.
>
> The link below contains a Jython script which populates an HBase table with
> data in two column familes. A corresponding Pig query retrieves data for one
> column and saves it to a CSV:
>
> https://gist.github.com/766929
>
> The Jython script has the following usage:
>
> > jython hbase_test.py [table] [column count] [row count] [batch count]
>
> This will populate a table named [table] with two column families. The
> first contains static data. The second contains the given number of columns,
> populated with data.
>
> The Pig query will return an inaccurate number of results for certain table
> sizes and configurations, most notably with tables exceeding 1.8 million
> rows in length and with more than 2 columns in the queried column family,
> eg.
>
> > jython hbase_test.py test 3 1800000 100000
>
> For instance, if I execute the above command and the corresponding Pig
> query, the results number 905914. Note that if the table is re-populated and
> queried a second time, a different number results. If I run the query again
> without re-populating the table, I get the same number of results. The HBase
> shell returns an accurate row count.
>
> Some notes on reproducing this issue (or not):
>
> * If the Jython script doesn't populate the meta column family, the issue
> goes away with the same query.
> * If the Jython script populates 2 columns instead of 3, the issue goes
> away with the same query.
> * The size of the column key or its value may influence whether the issue
> occurs.
>    For instance, if I change the script to store 'value_%d' instead of
> 'value_%d_%d', retaining the random int, the issue goes away with the same
> query.
>
> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
> using the stock Java that came with the OS. Attached is a log of the Pig
> console output. The error logs contain nothing of import.
>
> Am I doing anything incorrectly? Is there a way I can work around this
> issue without compromising the column family being queried?
>
> This appears to be a fairly simple case of Pig/HBase usage. Can anyone else
> reproduce the issue?
>
> thanks,
> Ian.
>
>