You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by John Jensen <je...@richrelevance.com> on 2013/05/23 01:01:53 UTC
Problem running job with large number of directories
I have a curious problem when running a crunch job on (avro) files in a fairly large set of directories (just slightly less than 100).
After running some fraction of the mappers they start failing with the exception below. Things work fine with a smaller number of directories.
The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' string shows up in the 'crunch.inputs.dir' entry in the job config, so I assume it has something to do with deserializing that value, but reading through the code I don't see any obvious way how.
Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it would not surprise me if I'm running up against a hadoop limit somewhere.
Has anybody else seen similar issues? (this is 0.5.0, btw).
-- John
java.io.IOException: Split class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
... 7 more
Re: Problem running job with large number of directories
Posted by Josh Wills <jw...@cloudera.com>.
Hey John,
I posted a patch here: https://issues.apache.org/jira/browse/CRUNCH-209
I created it against master, as I don't think there have been any changes
to the MR execution stuff in 0.6.0 we need to worry about, but if you can't
apply it, let me know and I'll find a way to backport it. I'm 50-50 on
whether this will fix the issue, so please let me know if this doesn't do
the trick.
J
On Wed, May 22, 2013 at 4:42 PM, John Jensen <je...@richrelevance.com>wrote:
>
> Certainly. Appreciate it.
>
> ------------------------------
> *From:* Josh Wills [jwills@cloudera.com]
> *Sent:* Wednesday, May 22, 2013 4:38 PM
> *To:* user@crunch.apache.org
> *Subject:* Re: Problem running job with large number of directories
>
> Hey John,
>
> I haven't hit that one before, but I have some hypothesis we could test
> if you're up for some trying out some patches I write.
>
> J
>
>
> On Wed, May 22, 2013 at 4:01 PM, John Jensen <je...@richrelevance.com>wrote:
>
>>
>> I have a curious problem when running a crunch job on (avro) files in a
>> fairly large set of directories (just slightly less than 100).
>> After running some fraction of the mappers they start failing with the
>> exception below. Things work fine with a smaller number of directories.
>>
>> The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
>> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
>> assume it has something to do with deserializing that value, but reading
>> through the code I don't see any obvious way how.
>>
>> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so
>> it would not surprise me if I'm running up against a hadoop limit somewhere.
>>
>> Has anybody else seen similar issues? (this is 0.5.0, btw).
>>
>> -- John
>>
>> java.io.IOException: Split class zdHJp
>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
>> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>> at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> Caused by: java.lang.ClassNotFoundException: Class zdHJp
>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
>> at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
>> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
>> ... 7 more
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>
RE: Problem running job with large number of directories
Posted by John Jensen <je...@richrelevance.com>.
Certainly. Appreciate it.
________________________________
From: Josh Wills [jwills@cloudera.com]
Sent: Wednesday, May 22, 2013 4:38 PM
To: user@crunch.apache.org
Subject: Re: Problem running job with large number of directories
Hey John,
I haven't hit that one before, but I have some hypothesis we could test if you're up for some trying out some patches I write.
J
On Wed, May 22, 2013 at 4:01 PM, John Jensen <je...@richrelevance.com>> wrote:
I have a curious problem when running a crunch job on (avro) files in a fairly large set of directories (just slightly less than 100).
After running some fraction of the mappers they start failing with the exception below. Things work fine with a smaller number of directories.
The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' string shows up in the 'crunch.inputs.dir' entry in the job config, so I assume it has something to do with deserializing that value, but reading through the code I don't see any obvious way how.
Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it would not surprise me if I'm running up against a hadoop limit somewhere.
Has anybody else seen similar issues? (this is 0.5.0, btw).
-- John
java.io.IOException: Split class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class zdHJp
bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
... 7 more
--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>
Re: Problem running job with large number of directories
Posted by Josh Wills <jw...@cloudera.com>.
Hey John,
I haven't hit that one before, but I have some hypothesis we could test if
you're up for some trying out some patches I write.
J
On Wed, May 22, 2013 at 4:01 PM, John Jensen <je...@richrelevance.com>wrote:
>
> I have a curious problem when running a crunch job on (avro) files in a
> fairly large set of directories (just slightly less than 100).
> After running some fraction of the mappers they start failing with the
> exception below. Things work fine with a smaller number of directories.
>
> The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
> assume it has something to do with deserializing that value, but reading
> through the code I don't see any obvious way how.
>
> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so
> it would not surprise me if I'm running up against a hadoop limit somewhere.
>
> Has anybody else seen similar issues? (this is 0.5.0, btw).
>
> -- John
>
> java.io.IOException: Split class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> ... 7 more
>
>
--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>