You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Eugene Morozov <em...@griddynamics.com> on 2013/04/09 09:58:28 UTC

HBaseStorage. Inconsistent result.

Hello everyone.

I have following script:
pages = LOAD 'hbase://mmpages' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('t:d', '-loadKey');
pages2 = FOREACH pages GENERATE $0;
pages3 = DISTINCT pages2;
g_pages = GROUP pages3 all PARALLEL 1;
s_pages = FOREACH g_pages GENERATE 'count', COUNT(pages3);
DUMP s_pages;

It just calculates number of keys in the table.
The issue with this is that it gives me different results.
I had two launch.
    * first one - 7 tasks in parallel (I launched same script 7 times
trying to imitate heavy workload)
    * second one - 9 tasks in parallel.

All 7 guys in first and 8 guys in second give me correct result, which is:

Input(s):
Successfully read 246419854 records (102194 bytes) from: "hbase://mmpages"
...
(count,246419854)


But one last of second run gives different
Input(s):
Successfully read 246419853 records (102194 bytes) from: "hbase://mmpages"
...
(count,246419853)

Number of read bytes is same, but number of rows is different.

There was definitely no change in mmpages. We do not use standard
Put/Delete - only bulkImport and there were no  Major compaction run on
this table. Even if it would be run, it wouldn't delete anything,
because TTL of this page is => '2147483647'. Moreover this table was for
debug purposes - nobody uses it, but me.


Original issue I got was actually same, but with my own HBaseStorage. It
gives much less consistent results. For example for 7 parallel run it gives
me:
--(count,246419854)
--(count,246419173) : Successfully read 246419173 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246419854) : Successfully read 246419854 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246419854) : Successfully read 246419854 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246419173) : Successfully read 246419173 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246418816) : Successfully read 246418816 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246418690)
-- and one job has been failed due to lease exception.
During run with my own HBaseStorage I see many map tasks killed with "lease
does not exist exception", though job usually finish successful.

As you can see number of read bytes is exactly same every time, but numbers
of read rows are different. Exactly same I got with native HBaseStorage,
though difference is really small.

But anyway, I didn't expect to see that original HBaseStorage could also do
the trick. And now my question is more about org.apache...HBaseStorage than
about my own HBaseStorage.

Any advice
    to prove anything regarding native org.apache...HBaseStorage to fix it
or
    to do more experiments on the matter would be really really appreciated.
-- 
Eugene Morozov
Developer of Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

Re: HBaseStorage. Inconsistent result.

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Can you run a RowCounter a bunch of times to see if it exhibits the same
issue? It would tell us if it's HBase or Pig that causes the issue.

http://hbase.apache.org/book.html#rowcounter

J-D


On Tue, Apr 9, 2013 at 3:58 AM, Eugene Morozov <em...@griddynamics.com>wrote:

> Hello everyone.
>
> I have following script:
> pages = LOAD 'hbase://mmpages' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('t:d', '-loadKey');
> pages2 = FOREACH pages GENERATE $0;
> pages3 = DISTINCT pages2;
> g_pages = GROUP pages3 all PARALLEL 1;
> s_pages = FOREACH g_pages GENERATE 'count', COUNT(pages3);
> DUMP s_pages;
>
> It just calculates number of keys in the table.
> The issue with this is that it gives me different results.
> I had two launch.
>     * first one - 7 tasks in parallel (I launched same script 7 times
> trying to imitate heavy workload)
>     * second one - 9 tasks in parallel.
>
> All 7 guys in first and 8 guys in second give me correct result, which is:
>
> Input(s):
> Successfully read 246419854 records (102194 bytes) from: "hbase://mmpages"
> ...
> (count,246419854)
>
>
> But one last of second run gives different
> Input(s):
> Successfully read 246419853 records (102194 bytes) from: "hbase://mmpages"
> ...
> (count,246419853)
>
> Number of read bytes is same, but number of rows is different.
>
> There was definitely no change in mmpages. We do not use standard
> Put/Delete - only bulkImport and there were no  Major compaction run on
> this table. Even if it would be run, it wouldn't delete anything,
> because TTL of this page is => '2147483647'. Moreover this table was for
> debug purposes - nobody uses it, but me.
>
>
> Original issue I got was actually same, but with my own HBaseStorage. It
> gives much less consistent results. For example for 7 parallel run it gives
> me:
> --(count,246419854)
> --(count,246419173) : Successfully read 246419173 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419854) : Successfully read 246419854 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419854) : Successfully read 246419854 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419173) : Successfully read 246419173 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246418816) : Successfully read 246418816 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246418690)
> -- and one job has been failed due to lease exception.
> During run with my own HBaseStorage I see many map tasks killed with "lease
> does not exist exception", though job usually finish successful.
>
> As you can see number of read bytes is exactly same every time, but numbers
> of read rows are different. Exactly same I got with native HBaseStorage,
> though difference is really small.
>
> But anyway, I didn't expect to see that original HBaseStorage could also do
> the trick. And now my question is more about org.apache...HBaseStorage than
> about my own HBaseStorage.
>
> Any advice
>     to prove anything regarding native org.apache...HBaseStorage to fix it
> or
>     to do more experiments on the matter would be really really
> appreciated.
> --
> Eugene Morozov
> Developer of Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> emorozov@griddynamics.com
>