You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@chukwa.apache.org by Ariel Rabkin <as...@gmail.com> on 2011/01/04 04:37:57 UTC

questions about pig

Got a couple questions about the pig-based aggregation. These may
slightly duplicate JIRA comments, so apologies and no need to answer
more than once.

1) Can we run the aggregation scripts in local mode?  I haven't been
able to get Pig to read from anything other than file:/// in local
mode. Is there a trick to it?

2) Is there a good way to sanity check my tables and make sure the
data in HBase looks right? Not quite sure what they "should" look
like.

3) What's the default epoch to start aggregating from?  What happens
if I don't specify START=  to the command?

4) Is there a good way to find out what the last epoch it started
summarizing from was?  Is there a big cost to being over-inclusive?


--Ari

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: questions about pig

Posted by Eric Yang <ey...@yahoo-inc.com>.

I am also getting the same problem on my cluster.  I will have a patch for the empty row key problem soon.

Regards,
Eric

On 1/4/11 8:42 PM, "Eric Yang" <er...@gmail.com> wrote:

It looks like HbaseStorage is intentional to make empty row key
invalid.  Nothing can be done at script side to skip this.  You should
be able to fetch the empty row with:

get 'SystemMetrics, ''

in hbase shell.  If something responded, then you need to delete this row.

deleteall 'SystemMetrics', ''

Question is, how do you end up with a empty row key?  I can figure out
how this is possible if the metrics are streamed by using
SystemMetrics Adaptor.  Any idea?

regards,
Eric

On Tue, Jan 4, 2011 at 6:05 PM, Ariel Rabkin <as...@gmail.com> wrote:
> Hm.
>
> Table is biggish; awk. to scan by hand. Can we modify the script to
> ignore empty rows?
>
> --Ari
>
> On Tue, Jan 4, 2011 at 8:35 PM, Eric Yang <ey...@yahoo-inc.com> wrote:
>> This looks like the row key is empty after parsing.  What does the row key look like in SystemMetrics table?
>> The expected format is:
>>
>> 1234567890000-hostname
>>
>> Make sure there is no empty row key in SystemMetrics table.
>>
>> Regards,
>> Eric
>>
>> On 1/4/11 5:09 PM, "Ariel Rabkin" <as...@gmail.com> wrote:
>>
>> So I have pig+hbase running. Thanks so much!
>>
>> But now I get the following error, from the System Metrics aggregation:
>>
>> java.io.IOException: java.lang.IllegalArgumentException: Row key is invalid
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:438)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
>>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
>> Caused by: java.lang.IllegalArgumentException: Row key is invalid
>>        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:79)
>>        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:69)
>>        at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:355)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:138)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:97)
>>        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508)
>>        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:436)
>>        ... 7 more
>>
>>
>> Thoughts?
>>
>>
>>
>> --
>> Ari Rabkin asrabkin@gmail.com
>> UC Berkeley Computer Science Department
>>
>>
>
>
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>

Re: questions about pig

Posted by Eric Yang <er...@gmail.com>.

It looks like HbaseStorage is intentional to make empty row key
invalid.  Nothing can be done at script side to skip this.  You should
be able to fetch the empty row with:

get 'SystemMetrics, ''

in hbase shell.  If something responded, then you need to delete this row.

deleteall 'SystemMetrics', ''

Question is, how do you end up with a empty row key?  I can figure out
how this is possible if the metrics are streamed by using
SystemMetrics Adaptor.  Any idea?

regards,
Eric

On Tue, Jan 4, 2011 at 6:05 PM, Ariel Rabkin <as...@gmail.com> wrote:
> Hm.
>
> Table is biggish; awk. to scan by hand. Can we modify the script to
> ignore empty rows?
>
> --Ari
>
> On Tue, Jan 4, 2011 at 8:35 PM, Eric Yang <ey...@yahoo-inc.com> wrote:
>> This looks like the row key is empty after parsing.  What does the row key look like in SystemMetrics table?
>> The expected format is:
>>
>> 1234567890000-hostname
>>
>> Make sure there is no empty row key in SystemMetrics table.
>>
>> Regards,
>> Eric
>>
>> On 1/4/11 5:09 PM, "Ariel Rabkin" <as...@gmail.com> wrote:
>>
>> So I have pig+hbase running. Thanks so much!
>>
>> But now I get the following error, from the System Metrics aggregation:
>>
>> java.io.IOException: java.lang.IllegalArgumentException: Row key is invalid
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:438)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
>>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
>> Caused by: java.lang.IllegalArgumentException: Row key is invalid
>>        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:79)
>>        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:69)
>>        at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:355)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:138)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:97)
>>        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508)
>>        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:436)
>>        ... 7 more
>>
>>
>> Thoughts?
>>
>>
>>
>> --
>> Ari Rabkin asrabkin@gmail.com
>> UC Berkeley Computer Science Department
>>
>>
>
>
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>

Re: questions about pig

Posted by Ariel Rabkin <as...@gmail.com>.

Hm.

Table is biggish; awk. to scan by hand. Can we modify the script to
ignore empty rows?

--Ari

On Tue, Jan 4, 2011 at 8:35 PM, Eric Yang <ey...@yahoo-inc.com> wrote:
> This looks like the row key is empty after parsing.  What does the row key look like in SystemMetrics table?
> The expected format is:
>
> 1234567890000-hostname
>
> Make sure there is no empty row key in SystemMetrics table.
>
> Regards,
> Eric
>
> On 1/4/11 5:09 PM, "Ariel Rabkin" <as...@gmail.com> wrote:
>
> So I have pig+hbase running. Thanks so much!
>
> But now I get the following error, from the System Metrics aggregation:
>
> java.io.IOException: java.lang.IllegalArgumentException: Row key is invalid
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:438)
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> Caused by: java.lang.IllegalArgumentException: Row key is invalid
>        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:79)
>        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:69)
>        at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:355)
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:138)
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:97)
>        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508)
>        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:436)
>        ... 7 more
>
>
> Thoughts?
>
>
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: questions about pig

Posted by Eric Yang <ey...@yahoo-inc.com>.

This looks like the row key is empty after parsing.  What does the row key look like in SystemMetrics table?
The expected format is:

1234567890000-hostname

Make sure there is no empty row key in SystemMetrics table.

Regards,
Eric

On 1/4/11 5:09 PM, "Ariel Rabkin" <as...@gmail.com> wrote:

So I have pig+hbase running. Thanks so much!

But now I get the following error, from the System Metrics aggregation:

java.io.IOException: java.lang.IllegalArgumentException: Row key is invalid
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:438)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Caused by: java.lang.IllegalArgumentException: Row key is invalid
        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:79)
        at org.apache.hadoop.hbase.client.Put.<init>(Put.java:69)
        at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:355)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:138)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:97)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:436)
        ... 7 more


Thoughts?



--
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: questions about pig

Posted by Ariel Rabkin <as...@gmail.com>.

So I have pig+hbase running. Thanks so much!

But now I get the following error, from the System Metrics aggregation:

java.io.IOException: java.lang.IllegalArgumentException: Row key is invalid
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:438)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Caused by: java.lang.IllegalArgumentException: Row key is invalid
	at org.apache.hadoop.hbase.client.Put.<init>(Put.java:79)
	at org.apache.hadoop.hbase.client.Put.<init>(Put.java:69)
	at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:355)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:138)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:97)
	at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508)
	at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:436)
	... 7 more


Thoughts?



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: questions about pig

Posted by Eric Yang <er...@gmail.com>.

Hi Ari,

1) It should work with both mapreduce mode and local mode.  Make sure
PIG_CLASSPATH contains hadoop and hbase config directories.  In
addition, make sure you are loading
-Dpig.additional.jars=$PIG_PATH/pig-0.8-core.jar:$HBASE_HOME/hbase-0.20.6.jar

2) use hbase shell, and run:

scan "ClusterSummary";

There should be some data in all column family in text readable form.
It should be easy to write a unit test program to verify the data.
Sample from my cluster:

 1294116960000-chukwa        column=cpu:User, timestamp=1294117043968,
value=0.041503639495471464
 1294116960000-chukwa        column=disk:ReadBytes,
timestamp=1294117043968, value=28672.0
 1294116960000-chukwa        column=disk:Reads,
timestamp=1294117043968, value=3.0
 1294116960000-chukwa        column=disk:WriteBytes,
timestamp=1294117043968, value=2564096.0
 1294116960000-chukwa        column=disk:Writes,
timestamp=1294117043968, value=213.0
 1294116960000-chukwa        column=hdfs:BlockCapacity,
timestamp=1294117041309, value=32

The first column is row key, and it's composed of [timestamp]-[clustername]

3) The default epoch starts from where you want to start the
aggregation.  By default, the script uses 1234567890000, which
translate to:

Fri, 13 Feb 2009 23:31:30 GMT

If you only want to run aggregation from current day, you can run the
script with START=current epoch time in milliseconds.

4) Create a java program and implement bloom filter to scan hbase for
the last row.  I think that is very inefficient way to go about this.
It's best to launch the aggregation script in fix interval (crontab or
oozie) to continuously process the data, but use 2x interval for the
scanning range to make sure the late arrival data are covered.  For
example, if you run the aggregation script every 5 minutes, then use
START=current time-10 minutes.

regard,
Eric

On Mon, Jan 3, 2011 at 7:37 PM, Ariel Rabkin <as...@gmail.com> wrote:
> Got a couple questions about the pig-based aggregation. These may
> slightly duplicate JIRA comments, so apologies and no need to answer
> more than once.
>
> 1) Can we run the aggregation scripts in local mode?  I haven't been
> able to get Pig to read from anything other than file:/// in local
> mode. Is there a trick to it?
>
> 2) Is there a good way to sanity check my tables and make sure the
> data in HBase looks right? Not quite sure what they "should" look
> like.
>
> 3) What's the default epoch to start aggregating from?  What happens
> if I don't specify START=  to the command?
>
> 4) Is there a good way to find out what the last epoch it started
> summarizing from was?  Is there a big cost to being over-inclusive?
>
>
> --Ari
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>