You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Joe Crobak <jo...@gmail.com> on 2012/04/05 17:44:21 UTC

strange gzip-related error

Hi,

I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
order to get json parsing. I have an incredibly unusual error that I see
with certain gzip compressed files. It's probably easiest to show you a pig
session:

grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
grunt> register '/home/joe/json-simple-1.1.jar';
grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
TextLoader() as (line: chararray);
grunt> X = FOREACH apiHits GENERATE line,
com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
grunt> Y = LIMIT X 2;
grunt> dump Y;
(succeeds, and I get what I expect).

Now, if I try to do a projection using the json field, I get the following:

grunt> A = FILTER X BY
>>   json#'logtype' == 'foo'
>>   OR json#'consumer' == 'foo1'
>>   OR json#'consumer' == 'foo2'
>>   OR json#'consumer' == 'foo3'
>>   OR json#'consumer' == 'foo4'
>>   ;
grunt> B = LIMIT A 2;
grunt> dump B;

ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Long
cannot be cast to org.json.simple.JSONObject

And in the task tracker logs, the stack trace suggests that the json udf is
seeing compressed data [1]. Does anyone have any ideas how to debug this,
or guesses to what the problem is? Can I somehow determine if hadoop is
actually decompressing the data or not?

Thanks!
Joe

[1]

2012-04-05 14:39:20,211 WARN
com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
json-decode string: ����
Unexpected character () at position 0.
	at org.json.simple.parser.Yylex.yylex(Unknown Source)
	at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
	at org.json.simple.parser.JSONParser.parse(Unknown Source)
	at org.json.simple.parser.JSONParser.parse(Unknown Source)
	at org.json.simple.parser.JSONParser.parse(Unknown Source)
	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
	at org.apache.hadoop.mapred.Child.main(Child.java:264)

Re: strange gzip-related error

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Joe, we'd be happy to take a pull request that addresses this cast
exception and maybe increments a counter

On Mon, Apr 9, 2012 at 2:27 PM, Joe Crobak <jo...@gmail.com> wrote:
> Hi Norbert,
>
> In some cases, I actually get a ClassCastException, which I guess are the
> eventual cause of the job failures:
>
> java.lang.ClassCastException: java.lang.Long cannot be cast to
> org.json.simple.JSONObject
>        at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:52)
>        at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:42)
>        at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:22)
>
> (Note that I switched back to the 2.1.11 tag, so the stack trace
> corresponds to
> https://github.com/kevinweil/elephant-bird/blob/b300849f6d014aaac520e385a34aa37adb53b5fa/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java)
>
> I've put together a dummy heuristic to skip lines that don't match
> ^\\{.*\\}$ and this seems to get me past the CCE.
>
> Thanks for the info, though, I clearly missed the logging that you pointed
> out.
>
> Joe
>
>
>
> On Mon, Apr 9, 2012 at 4:36 PM, Norbert Burger <no...@gmail.com>wrote:
>
>> So in this case, it seems like JsonStringToMap is properly catching the
>> parse exception; in fact, it's the catch clause of the UDF that's
>> generating the "Could not json-decode string" message in your task tracker
>> logs.
>>
>> Take a look at line 63 here:
>>
>> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
>>
>> When a parse exception happens, the UDF returns a null.  Are you filtering
>> out nulls before trying to project?
>>
>> Norbert
>>
>> On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <jo...@gmail.com> wrote:
>>
>> > so it turns out our uncompressed data contains corrupted rows. Is there a
>> > way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
>> > unparsable lines and just skip them?
>> >
>> > On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <jo...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
>> > > order to get json parsing. I have an incredibly unusual error that I
>> see
>> > > with certain gzip compressed files. It's probably easiest to show you a
>> > pig
>> > > session:
>> > >
>> > > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
>> > > grunt> register '/home/joe/json-simple-1.1.jar';
>> > > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
>> > > TextLoader() as (line: chararray);
>> > > grunt> X = FOREACH apiHits GENERATE line,
>> > > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
>> > > grunt> Y = LIMIT X 2;
>> > > grunt> dump Y;
>> > > (succeeds, and I get what I expect).
>> > >
>> > > Now, if I try to do a projection using the json field, I get the
>> > following:
>> > >
>> > > grunt> A = FILTER X BY
>> > > >>   json#'logtype' == 'foo'
>> > > >>   OR json#'consumer' == 'foo1'
>> > > >>   OR json#'consumer' == 'foo2'
>> > > >>   OR json#'consumer' == 'foo3'
>> > > >>   OR json#'consumer' == 'foo4'
>> > > >>   ;
>> > > grunt> B = LIMIT A 2;
>> > > grunt> dump B;
>> > >
>> > > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR:
>> > java.lang.Long
>> > > cannot be cast to org.json.simple.JSONObject
>> > >
>> > > And in the task tracker logs, the stack trace suggests that the json
>> udf
>> > > is seeing compressed data [1]. Does anyone have any ideas how to debug
>> > > this, or guesses to what the problem is? Can I somehow determine if
>> > hadoop
>> > > is actually decompressing the data or not?
>> > >
>> > > Thanks!
>> > > Joe
>> > >
>> > > [1]
>> > >
>> > > 2012-04-05 14:39:20,211 WARN
>> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
>> > json-decode string:  � ���
>> > > Unexpected character ( ) at position 0.
>> > >       at org.json.simple.parser.Yylex.yylex(Unknown Source)
>> > >       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
>> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
>> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
>> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
>> > >       at
>> >
>> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
>> > >       at
>> >
>> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
>> > >       at
>> >
>> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
>> > >       at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>> > >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> > >       at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>> > >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>> > >       at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>> > >       at java.security.AccessController.doPrivileged(Native Method)
>> > >       at javax.security.auth.Subject.doAs(Subject.java:396)
>> > >       at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>> > >       at org.apache.hadoop.mapred.Child.main(Child.java:264)
>> > >
>> > >
>> > >
>> >
>>

Re: strange gzip-related error

Posted by Joe Crobak <jo...@gmail.com>.
Hi Norbert,

In some cases, I actually get a ClassCastException, which I guess are the
eventual cause of the job failures:

java.lang.ClassCastException: java.lang.Long cannot be cast to
org.json.simple.JSONObject
	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:52)
	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:42)
	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:22)

(Note that I switched back to the 2.1.11 tag, so the stack trace
corresponds to
https://github.com/kevinweil/elephant-bird/blob/b300849f6d014aaac520e385a34aa37adb53b5fa/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java)

I've put together a dummy heuristic to skip lines that don't match
^\\{.*\\}$ and this seems to get me past the CCE.

Thanks for the info, though, I clearly missed the logging that you pointed
out.

Joe



On Mon, Apr 9, 2012 at 4:36 PM, Norbert Burger <no...@gmail.com>wrote:

> So in this case, it seems like JsonStringToMap is properly catching the
> parse exception; in fact, it's the catch clause of the UDF that's
> generating the "Could not json-decode string" message in your task tracker
> logs.
>
> Take a look at line 63 here:
>
> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
>
> When a parse exception happens, the UDF returns a null.  Are you filtering
> out nulls before trying to project?
>
> Norbert
>
> On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <jo...@gmail.com> wrote:
>
> > so it turns out our uncompressed data contains corrupted rows. Is there a
> > way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
> > unparsable lines and just skip them?
> >
> > On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <jo...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> > > order to get json parsing. I have an incredibly unusual error that I
> see
> > > with certain gzip compressed files. It's probably easiest to show you a
> > pig
> > > session:
> > >
> > > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> > > grunt> register '/home/joe/json-simple-1.1.jar';
> > > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> > > TextLoader() as (line: chararray);
> > > grunt> X = FOREACH apiHits GENERATE line,
> > > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> > > grunt> Y = LIMIT X 2;
> > > grunt> dump Y;
> > > (succeeds, and I get what I expect).
> > >
> > > Now, if I try to do a projection using the json field, I get the
> > following:
> > >
> > > grunt> A = FILTER X BY
> > > >>   json#'logtype' == 'foo'
> > > >>   OR json#'consumer' == 'foo1'
> > > >>   OR json#'consumer' == 'foo2'
> > > >>   OR json#'consumer' == 'foo3'
> > > >>   OR json#'consumer' == 'foo4'
> > > >>   ;
> > > grunt> B = LIMIT A 2;
> > > grunt> dump B;
> > >
> > > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR:
> > java.lang.Long
> > > cannot be cast to org.json.simple.JSONObject
> > >
> > > And in the task tracker logs, the stack trace suggests that the json
> udf
> > > is seeing compressed data [1]. Does anyone have any ideas how to debug
> > > this, or guesses to what the problem is? Can I somehow determine if
> > hadoop
> > > is actually decompressing the data or not?
> > >
> > > Thanks!
> > > Joe
> > >
> > > [1]
> > >
> > > 2012-04-05 14:39:20,211 WARN
> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
> > json-decode string:  � ���
> > > Unexpected character ( ) at position 0.
> > >       at org.json.simple.parser.Yylex.yylex(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> > >       at
> >
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
> > >       at
> >
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
> > >       at
> >
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> > >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> > >       at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> > >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> > >       at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> > >       at java.security.AccessController.doPrivileged(Native Method)
> > >       at javax.security.auth.Subject.doAs(Subject.java:396)
> > >       at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> > >       at org.apache.hadoop.mapred.Child.main(Child.java:264)
> > >
> > >
> > >
> >
>

Re: strange gzip-related error

Posted by Norbert Burger <no...@gmail.com>.
So in this case, it seems like JsonStringToMap is properly catching the
parse exception; in fact, it's the catch clause of the UDF that's
generating the "Could not json-decode string" message in your task tracker
logs.

Take a look at line 63 here:
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java

When a parse exception happens, the UDF returns a null.  Are you filtering
out nulls before trying to project?

Norbert

On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <jo...@gmail.com> wrote:

> so it turns out our uncompressed data contains corrupted rows. Is there a
> way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
> unparsable lines and just skip them?
>
> On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <jo...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> > order to get json parsing. I have an incredibly unusual error that I see
> > with certain gzip compressed files. It's probably easiest to show you a
> pig
> > session:
> >
> > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> > grunt> register '/home/joe/json-simple-1.1.jar';
> > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> > TextLoader() as (line: chararray);
> > grunt> X = FOREACH apiHits GENERATE line,
> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> > grunt> Y = LIMIT X 2;
> > grunt> dump Y;
> > (succeeds, and I get what I expect).
> >
> > Now, if I try to do a projection using the json field, I get the
> following:
> >
> > grunt> A = FILTER X BY
> > >>   json#'logtype' == 'foo'
> > >>   OR json#'consumer' == 'foo1'
> > >>   OR json#'consumer' == 'foo2'
> > >>   OR json#'consumer' == 'foo3'
> > >>   OR json#'consumer' == 'foo4'
> > >>   ;
> > grunt> B = LIMIT A 2;
> > grunt> dump B;
> >
> > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR:
> java.lang.Long
> > cannot be cast to org.json.simple.JSONObject
> >
> > And in the task tracker logs, the stack trace suggests that the json udf
> > is seeing compressed data [1]. Does anyone have any ideas how to debug
> > this, or guesses to what the problem is? Can I somehow determine if
> hadoop
> > is actually decompressing the data or not?
> >
> > Thanks!
> > Joe
> >
> > [1]
> >
> > 2012-04-05 14:39:20,211 WARN
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
> json-decode string:  � ���
> > Unexpected character ( ) at position 0.
> >       at org.json.simple.parser.Yylex.yylex(Unknown Source)
> >       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> >       at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
> >       at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> >       at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> >       at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> >       at java.security.AccessController.doPrivileged(Native Method)
> >       at javax.security.auth.Subject.doAs(Subject.java:396)
> >       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> >       at org.apache.hadoop.mapred.Child.main(Child.java:264)
> >
> >
> >
>

Re: strange gzip-related error

Posted by Joe Crobak <jo...@gmail.com>.
so it turns out our uncompressed data contains corrupted rows. Is there a
way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
unparsable lines and just skip them?

On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <jo...@gmail.com> wrote:

> Hi,
>
> I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> order to get json parsing. I have an incredibly unusual error that I see
> with certain gzip compressed files. It's probably easiest to show you a pig
> session:
>
> grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> grunt> register '/home/joe/json-simple-1.1.jar';
> grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> TextLoader() as (line: chararray);
> grunt> X = FOREACH apiHits GENERATE line,
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> grunt> Y = LIMIT X 2;
> grunt> dump Y;
> (succeeds, and I get what I expect).
>
> Now, if I try to do a projection using the json field, I get the following:
>
> grunt> A = FILTER X BY
> >>   json#'logtype' == 'foo'
> >>   OR json#'consumer' == 'foo1'
> >>   OR json#'consumer' == 'foo2'
> >>   OR json#'consumer' == 'foo3'
> >>   OR json#'consumer' == 'foo4'
> >>   ;
> grunt> B = LIMIT A 2;
> grunt> dump B;
>
> ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Long
> cannot be cast to org.json.simple.JSONObject
>
> And in the task tracker logs, the stack trace suggests that the json udf
> is seeing compressed data [1]. Does anyone have any ideas how to debug
> this, or guesses to what the problem is? Can I somehow determine if hadoop
> is actually decompressing the data or not?
>
> Thanks!
> Joe
>
> [1]
>
> 2012-04-05 14:39:20,211 WARN com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not json-decode string:  � ���
> Unexpected character ( ) at position 0.
> 	at org.json.simple.parser.Yylex.yylex(Unknown Source)
> 	at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
> 	at org.json.simple.parser.JSONParser.parse(Unknown Source)
> 	at org.json.simple.parser.JSONParser.parse(Unknown Source)
> 	at org.json.simple.parser.JSONParser.parse(Unknown Source)
> 	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
> 	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
> 	at com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
>
>