You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by John Jensen <je...@richrelevance.com> on 2013/05/23 01:01:53 UTC

Problem running job with large number of directories

I have a curious problem when running a crunch job on (avro) files in a fairly large set of directories (just slightly less than 100).
After running some fraction of the mappers they start failing with the exception below. Things work fine with a smaller number of directories.

The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' string shows up in the 'crunch.inputs.dir' entry in the job config, so I assume it has something to do with deserializing that value, but reading through the code I don't see any obvious way how.

Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it would not surprise me if I'm running up against a hadoop limit somewhere.

Has anybody else seen similar issues? (this is 0.5.0, btw).

-- John


java.io.IOException: Split class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
        ... 7 more

Re: Problem running job with large number of directories

Posted by Josh Wills <jw...@cloudera.com>.
Hey John,

I posted a patch here: https://issues.apache.org/jira/browse/CRUNCH-209

I created it against master, as I don't think there have been any changes
to the MR execution stuff in 0.6.0 we need to worry about, but if you can't
apply it, let me know and I'll find a way to backport it. I'm 50-50 on
whether this will fix the issue, so please let me know if this doesn't do
the trick.

J


On Wed, May 22, 2013 at 4:42 PM, John Jensen <je...@richrelevance.com>wrote:

>
>  Certainly. Appreciate it.
>
>  ------------------------------
> *From:* Josh Wills [jwills@cloudera.com]
> *Sent:* Wednesday, May 22, 2013 4:38 PM
> *To:* user@crunch.apache.org
> *Subject:* Re: Problem running job with large number of directories
>
>   Hey John,
>
>  I haven't hit that one before, but I have some hypothesis we could test
> if you're up for some trying out some patches I write.
>
>  J
>
>
> On Wed, May 22, 2013 at 4:01 PM, John Jensen <je...@richrelevance.com>wrote:
>
>>
>>  I have a curious problem when running a crunch job on (avro) files in a
>> fairly large set of directories (just slightly less than 100).
>> After running some fraction of the mappers they start failing with the
>> exception below. Things work fine with a smaller number of directories.
>>
>>  The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
>> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
>> assume it has something to do with deserializing that value, but reading
>> through the code I don't see any obvious way how.
>>
>>  Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so
>> it would not surprise me if I'm running up against a hadoop limit somewhere.
>>
>>  Has anybody else seen similar issues? (this is 0.5.0, btw).
>>
>>  -- John
>>
>>   java.io.IOException: Split class zdHJp
>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
>> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
>> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> Caused by: java.lang.ClassNotFoundException: Class zdHJp
>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
>> 	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
>> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
>> 	... 7 more
>>
>>
>
>
>  --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

RE: Problem running job with large number of directories

Posted by John Jensen <je...@richrelevance.com>.
Certainly. Appreciate it.

________________________________
From: Josh Wills [jwills@cloudera.com]
Sent: Wednesday, May 22, 2013 4:38 PM
To: user@crunch.apache.org
Subject: Re: Problem running job with large number of directories

Hey John,

I haven't hit that one before, but I have some hypothesis we could test if you're up for some trying out some patches I write.

J


On Wed, May 22, 2013 at 4:01 PM, John Jensen <je...@richrelevance.com>> wrote:

I have a curious problem when running a crunch job on (avro) files in a fairly large set of directories (just slightly less than 100).
After running some fraction of the mappers they start failing with the exception below. Things work fine with a smaller number of directories.

The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' string shows up in the 'crunch.inputs.dir' entry in the job config, so I assume it has something to do with deserializing that value, but reading through the code I don't see any obvious way how.

Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it would not surprise me if I'm running up against a hadoop limit somewhere.

Has anybody else seen similar issues? (this is 0.5.0, btw).

-- John


java.io.IOException: Split class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
        ... 7 more



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>

Re: Problem running job with large number of directories

Posted by Josh Wills <jw...@cloudera.com>.
Hey John,

I haven't hit that one before, but I have some hypothesis we could test if
you're up for some trying out some patches I write.

J


On Wed, May 22, 2013 at 4:01 PM, John Jensen <je...@richrelevance.com>wrote:

>
>  I have a curious problem when running a crunch job on (avro) files in a
> fairly large set of directories (just slightly less than 100).
> After running some fraction of the mappers they start failing with the
> exception below. Things work fine with a smaller number of directories.
>
>  The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
> assume it has something to do with deserializing that value, but reading
> through the code I don't see any obvious way how.
>
>  Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so
> it would not surprise me if I'm running up against a hadoop limit somewhere.
>
>  Has anybody else seen similar issues? (this is 0.5.0, btw).
>
>  -- John
>
>   java.io.IOException: Split class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> 	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> 	... 7 more
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>