You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/10/19 09:46:05 UTC
[jira] [Created] (CRUNCH-575) DistributedPipeline temp dir choice
can collide with itself
Sean Owen created CRUNCH-575:
--------------------------------
Summary: DistributedPipeline temp dir choice can collide with itself
Key: CRUNCH-575
URL: https://issues.apache.org/jira/browse/CRUNCH-575
Project: Crunch
Issue Type: Bug
Components: Core
Affects Versions: 0.12.0
Reporter: Sean Owen
Assignee: Josh Wills
Priority: Minor
We've observed that Crunch jobs can fail because the output temp dir already exists:
{code}
2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
{code}
One possible cause is the choice of random directory name, which is based on a random nonnegative 32-bit int. The chance of collision is more than 50% at about 55,000 temp dirs, which is not unimaginable.
A suggested fix, at least for that theoretical cause, is to generate a much larger random value. 64 bits should put this firmly in the realm of extremely improbably (billions, not tens of thousands).
(HT [~wilfreds] / CC [~tomwhite])
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)