You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by kiran chitturi <ch...@gmail.com> on 2013/03/13 15:48:40 UTC

Usage of 'limit' with Pig for Hbase

Hi!

I am using Pig 0.10.0 with Hbase in distributed mode to read the records
and I have used this command below.

fields = load 'hbase://documents' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage('field:fields_j','-loadKey
true  -limit 5') as (rowkey, fields:map[]);

I want pig to limit the records to only 5 but it is quite different. Please
see the logs below.

Input(s):
Successfully read 250 records (16520 bytes) from: "hbase://documents"

Output(s):
Successfully stored 250 records (19051 bytes) in:
"hdfs://LucidN1:50001/tmp/temp1510040776/tmp1443083789"

Counters:
> Total records written : 250
> Total bytes written : 19051
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
> Job DAG:
> job_201303121846_0056
>
> 2013-03-13 14:43:10,186 [main] WARN
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 250 time(s).
> 2013-03-13 14:43:10,186 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
> 2013-03-13 14:43:10,210 [main] INFO
>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 51
> 2013-03-13 14:43:10,211 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 51


Am I using the 'limit' keyword the wrong way ?

Please let me know your suggestions.

Thanks,
-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Usage of 'limit' with Pig for Hbase

Posted by kiran chitturi <ch...@gmail.com>.
Is this the good way to limit than using pig LIMIT like (fields = LIMIT
fields 5;) since filtering is already done while loading ?

Thanks,


On Thu, Mar 14, 2013 at 9:50 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> To explain what's going on:
> -limit for HBaseStorage limits the number of rows returned from *each
> region* in the hbase table. It's an optimization -- there is no way for the
> LIMIT operator to be pushed down to the loader, so you can do it explicitly
> if you know you only need a few rows and don't want to pull the rest from
> HBase just to drop them on the floor once they've been extracted and sent
> to your mappers.
>
>
> On Wed, Mar 13, 2013 at 9:17 AM, kiran chitturi
> <ch...@gmail.com>wrote:
>
> > Thank you. This cleared my doubt.
> >
> >
> > On Wed, Mar 13, 2013 at 11:37 AM, Bill Graham <bi...@gmail.com>
> > wrote:
> >
> > > The -limit passed to HBaseStorage is the limit per mapper reading from
> > > HBase. If you want to limit overall records, also use LIMIT:
> > >
> > > fields = LIMIT fields 5;
> > >
> > >
> > > On Wed, Mar 13, 2013 at 7:48 AM, kiran chitturi
> > > <ch...@gmail.com>wrote:
> > >
> > > > Hi!
> > > >
> > > > I am using Pig 0.10.0 with Hbase in distributed mode to read the
> > records
> > > > and I have used this command below.
> > > >
> > > > fields = load 'hbase://documents' using
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('field:fields_j','-loadKey
> > > > true  -limit 5') as (rowkey, fields:map[]);
> > > >
> > > > I want pig to limit the records to only 5 but it is quite different.
> > > Please
> > > > see the logs below.
> > > >
> > > > Input(s):
> > > > Successfully read 250 records (16520 bytes) from: "hbase://documents"
> > > >
> > > > Output(s):
> > > > Successfully stored 250 records (19051 bytes) in:
> > > > "hdfs://LucidN1:50001/tmp/temp1510040776/tmp1443083789"
> > > >
> > > > Counters:
> > > > > Total records written : 250
> > > > > Total bytes written : 19051
> > > > > Spillable Memory Manager spill count : 0
> > > > > Total bags proactively spilled: 0
> > > > > Total records proactively spilled: 0
> > > > > Job DAG:
> > > > > job_201303121846_0056
> > > > >
> > > > > 2013-03-13 14:43:10,186 [main] WARN
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 250
> > > time(s).
> > > > > 2013-03-13 14:43:10,186 [main] INFO
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > - Success!
> > > > > 2013-03-13 14:43:10,210 [main] INFO
> > > > >  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total
> input
> > > > paths
> > > > > to process : 51
> > > > > 2013-03-13 14:43:10,211 [main] INFO
> > > > >  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> > Total
> > > > > input paths to process : 51
> > > >
> > > >
> > > > Am I using the 'limit' keyword the wrong way ?
> > > >
> > > > Please let me know your suggestions.
> > > >
> > > > Thanks,
> > > > --
> > > > Kiran Chitturi
> > > >
> > > > <http://www.linkedin.com/in/kiranchitturi>
> > > >
> > >
> > >
> > >
> > > --
> > > *Note that I'm no longer using my Yahoo! email address. Please email me
> > at
> > > billgraham@gmail.com going forward.*
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
> > <http://www.linkedin.com/in/kiranchitturi>
> >
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Usage of 'limit' with Pig for Hbase

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
To explain what's going on:
-limit for HBaseStorage limits the number of rows returned from *each
region* in the hbase table. It's an optimization -- there is no way for the
LIMIT operator to be pushed down to the loader, so you can do it explicitly
if you know you only need a few rows and don't want to pull the rest from
HBase just to drop them on the floor once they've been extracted and sent
to your mappers.


On Wed, Mar 13, 2013 at 9:17 AM, kiran chitturi
<ch...@gmail.com>wrote:

> Thank you. This cleared my doubt.
>
>
> On Wed, Mar 13, 2013 at 11:37 AM, Bill Graham <bi...@gmail.com>
> wrote:
>
> > The -limit passed to HBaseStorage is the limit per mapper reading from
> > HBase. If you want to limit overall records, also use LIMIT:
> >
> > fields = LIMIT fields 5;
> >
> >
> > On Wed, Mar 13, 2013 at 7:48 AM, kiran chitturi
> > <ch...@gmail.com>wrote:
> >
> > > Hi!
> > >
> > > I am using Pig 0.10.0 with Hbase in distributed mode to read the
> records
> > > and I have used this command below.
> > >
> > > fields = load 'hbase://documents' using
> > >
> >
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('field:fields_j','-loadKey
> > > true  -limit 5') as (rowkey, fields:map[]);
> > >
> > > I want pig to limit the records to only 5 but it is quite different.
> > Please
> > > see the logs below.
> > >
> > > Input(s):
> > > Successfully read 250 records (16520 bytes) from: "hbase://documents"
> > >
> > > Output(s):
> > > Successfully stored 250 records (19051 bytes) in:
> > > "hdfs://LucidN1:50001/tmp/temp1510040776/tmp1443083789"
> > >
> > > Counters:
> > > > Total records written : 250
> > > > Total bytes written : 19051
> > > > Spillable Memory Manager spill count : 0
> > > > Total bags proactively spilled: 0
> > > > Total records proactively spilled: 0
> > > > Job DAG:
> > > > job_201303121846_0056
> > > >
> > > > 2013-03-13 14:43:10,186 [main] WARN
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 250
> > time(s).
> > > > 2013-03-13 14:43:10,186 [main] INFO
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > - Success!
> > > > 2013-03-13 14:43:10,210 [main] INFO
> > > >  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > > paths
> > > > to process : 51
> > > > 2013-03-13 14:43:10,211 [main] INFO
> > > >  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> Total
> > > > input paths to process : 51
> > >
> > >
> > > Am I using the 'limit' keyword the wrong way ?
> > >
> > > Please let me know your suggestions.
> > >
> > > Thanks,
> > > --
> > > Kiran Chitturi
> > >
> > > <http://www.linkedin.com/in/kiranchitturi>
> > >
> >
> >
> >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> > billgraham@gmail.com going forward.*
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>

Re: Usage of 'limit' with Pig for Hbase

Posted by kiran chitturi <ch...@gmail.com>.
Thank you. This cleared my doubt.


On Wed, Mar 13, 2013 at 11:37 AM, Bill Graham <bi...@gmail.com> wrote:

> The -limit passed to HBaseStorage is the limit per mapper reading from
> HBase. If you want to limit overall records, also use LIMIT:
>
> fields = LIMIT fields 5;
>
>
> On Wed, Mar 13, 2013 at 7:48 AM, kiran chitturi
> <ch...@gmail.com>wrote:
>
> > Hi!
> >
> > I am using Pig 0.10.0 with Hbase in distributed mode to read the records
> > and I have used this command below.
> >
> > fields = load 'hbase://documents' using
> >
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('field:fields_j','-loadKey
> > true  -limit 5') as (rowkey, fields:map[]);
> >
> > I want pig to limit the records to only 5 but it is quite different.
> Please
> > see the logs below.
> >
> > Input(s):
> > Successfully read 250 records (16520 bytes) from: "hbase://documents"
> >
> > Output(s):
> > Successfully stored 250 records (19051 bytes) in:
> > "hdfs://LucidN1:50001/tmp/temp1510040776/tmp1443083789"
> >
> > Counters:
> > > Total records written : 250
> > > Total bytes written : 19051
> > > Spillable Memory Manager spill count : 0
> > > Total bags proactively spilled: 0
> > > Total records proactively spilled: 0
> > > Job DAG:
> > > job_201303121846_0056
> > >
> > > 2013-03-13 14:43:10,186 [main] WARN
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 250
> time(s).
> > > 2013-03-13 14:43:10,186 [main] INFO
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - Success!
> > > 2013-03-13 14:43:10,210 [main] INFO
> > >  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > paths
> > > to process : 51
> > > 2013-03-13 14:43:10,211 [main] INFO
> > >  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > > input paths to process : 51
> >
> >
> > Am I using the 'limit' keyword the wrong way ?
> >
> > Please let me know your suggestions.
> >
> > Thanks,
> > --
> > Kiran Chitturi
> >
> > <http://www.linkedin.com/in/kiranchitturi>
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Usage of 'limit' with Pig for Hbase

Posted by Bill Graham <bi...@gmail.com>.
The -limit passed to HBaseStorage is the limit per mapper reading from
HBase. If you want to limit overall records, also use LIMIT:

fields = LIMIT fields 5;


On Wed, Mar 13, 2013 at 7:48 AM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi!
>
> I am using Pig 0.10.0 with Hbase in distributed mode to read the records
> and I have used this command below.
>
> fields = load 'hbase://documents' using
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('field:fields_j','-loadKey
> true  -limit 5') as (rowkey, fields:map[]);
>
> I want pig to limit the records to only 5 but it is quite different. Please
> see the logs below.
>
> Input(s):
> Successfully read 250 records (16520 bytes) from: "hbase://documents"
>
> Output(s):
> Successfully stored 250 records (19051 bytes) in:
> "hdfs://LucidN1:50001/tmp/temp1510040776/tmp1443083789"
>
> Counters:
> > Total records written : 250
> > Total bytes written : 19051
> > Spillable Memory Manager spill count : 0
> > Total bags proactively spilled: 0
> > Total records proactively spilled: 0
> > Job DAG:
> > job_201303121846_0056
> >
> > 2013-03-13 14:43:10,186 [main] WARN
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 250 time(s).
> > 2013-03-13 14:43:10,186 [main] INFO
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Success!
> > 2013-03-13 14:43:10,210 [main] INFO
> >  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths
> > to process : 51
> > 2013-03-13 14:43:10,211 [main] INFO
> >  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > input paths to process : 51
>
>
> Am I using the 'limit' keyword the wrong way ?
>
> Please let me know your suggestions.
>
> Thanks,
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*