You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Micah Whitacre (JIRA)" <ji...@apache.org> on 2015/01/02 17:27:13 UTC
[jira] [Updated] (CRUNCH-227) Write to sequence file ignores destination path.

     [ https://issues.apache.org/jira/browse/CRUNCH-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Whitacre updated CRUNCH-227:
----------------------------------
    Fix Version/s: 0.8.4
                   0.12.0

> Write to sequence file ignores destination path.
> ------------------------------------------------
>
>                 Key: CRUNCH-227
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-227
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.6.0, 0.7.0
>         Environment: Hadoop 1.0.3
>            Reporter: Florian Laws
>            Assignee: Micah Whitacre
>             Fix For: 0.8.4, 0.12.0
>
>         Attachments: 0001-CRUNCH-227-Added-test-that-shows-ToolRunner-does-not.patch, CRUNCH-227.patch, CRUNCH-227_tests.patch
>
>
> I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom Writable.
> The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(), but instead to a Crunch working directory.
> This happens when running the job both locally and on my 1-node Hadoop
> test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).
> When using pipeline.done() instead of pipeline.run(), the Crunch working directory gets removed after execution, in that case, the output is not retained at all.
> Code snippet:
> ---
> {code}
> public int run(String[] args) throws IOException {
>   CommandLine cl = parseCommandLine(args);
>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>   int docIdIndex = getColumnIndex(cl, "DocID");
>   int ldaIndex = getColumnIndex(cl, "LDA");
>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>   pipeline.setConfiguration(getConf());
>   PCollection<String> lines = pipeline.readTextFile((String)
> cl.getValue(INPUT_OPTION));
>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>   vectors.write(To.sequenceFile(output));
>   PipelineResult res = pipeline.run();
>   return res.succeeded() ? 0 : 1;
> }
> {code}
> ---
> Log output from local run.
> Note how the intended output path "/tmp/foo.seq" is reported in the
> execution plan,
> is not actually used.
> ---
> {code}
> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
> from SCDynamicStore
> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
> to new path: /tmp/foo.seq
> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
> classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
> MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
> with rwxr-xr-x
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
> "com.issuu.mahout.utils.DbDumpToSeqFile:
> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
> available at: http://localhost:8080/
> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
> is done. And is in the process of commiting
> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
> is allowed to commit now
> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
> task 'attempt_local_0001_m_000000_0' to
> /tmp/crunch-1128974463/p1/output
> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.
> {code}
> ---
> This crude patch makes the output end up at the right place,
> but breaks a lot of other tests.
> ---
> {code}
> --- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> +++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>    protected void configureForMapReduce(Job job, Class keyClass, Class
> valueClass,
>        Class outputFormatClass, Path outputPath, String name) {
>      try {
> -      FileOutputFormat.setOutputPath(job, outputPath);
> +      FileOutputFormat.setOutputPath(job, path);
>      } catch (Exception e) {
>        throw new RuntimeException(e);
>      }
> {code}
> ---



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)