You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Florian Laws <fl...@florianlaws.de> on 2013/06/25 16:30:10 UTC

Write to sequence file ignores destination path.

Hi all,

I'm trying to write a simple Crunch job that outputs a sequence file
consisting of a custom Writable.

The job runs successfully, but the output is not written to the path
that I specify in To.sequenceFile(),
but instead to a Crunch working directory.

This happens when running the job both locally and on my 1-node Hadoop
test cluster,
and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).

Code snippet:
---

public int run(String[] args) throws IOException {
  CommandLine cl = parseCommandLine(args);
  Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
  int docIdIndex = getColumnIndex(cl, "DocID");
  int ldaIndex = getColumnIndex(cl, "LDA");

  Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
  pipeline.setConfiguration(getConf());
  PCollection<String> lines = pipeline.readTextFile((String)
cl.getValue(INPUT_OPTION));
  PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
    new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
    tableOf(strings(), writables(NamedQuantizedVecWritable.class)));

  vectors.write(To.sequenceFile(output));

  PipelineResult res = pipeline.run();
  return res.succeeded() ? 0 : 1;
}
---

Log output from local run.
Note how the intended output path "/tmp/foo.seq" is reported in the
execution plan,
is not actually used.
---

2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
from SCDynamicStore
2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
to new path: /tmp/foo.seq
2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1
2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
with rwxr-xr-x
2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
/tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
/tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP

2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
"com.issuu.mahout.utils.DbDumpToSeqFile:
Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"

2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
available at: http://localhost:8080/
2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
is allowed to commit now

2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
task 'attempt_local_0001_m_000000_0' to
/tmp/crunch-1128974463/p1/output

2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.

---


This crude patch makes the output end up at the right place,
but breaks a lot of other tests.
---

--- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
+++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
@@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
   protected void configureForMapReduce(Job job, Class keyClass, Class
valueClass,
       Class outputFormatClass, Path outputPath, String name) {
     try {
-      FileOutputFormat.setOutputPath(job, outputPath);
+      FileOutputFormat.setOutputPath(job, path);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }

---


Am I doing something wrong, or is this a bug?

Best,

Florian

Re: Write to sequence file ignores destination path.

Posted by Florian Laws <fl...@florianlaws.de>.
I submitted CRUNCH-227 in JIRA for this now.

Best,

Florian

On Tue, Jun 25, 2013 at 6:12 PM, Florian Laws <fl...@florianlaws.de> wrote:
> I'll open a JIRA ticket when I'm back at a real computer. The Hadoop version
> should be 1.0.3.
>
> Best,
>
> Florian
>
> Am 25.06.2013 17:28 schrieb "Josh Wills" <jo...@gmail.com>:
>
>> Yeah, that's exactly how it works. I'll have some time to look at this
>> more closely later this morning. Would you mind opening a JIRA and/or
>> letting me know which Hadoop version you're running against?
>>
>> Thanks!
>> Josh
>>
>>
>> On Tue, Jun 25, 2013 at 7:46 AM, Florian Laws <fl...@florianlaws.de>
>> wrote:
>>>
>>> Hi Josh,
>>>
>>> thanks for the quick reply.
>>>
>>> With pipeline.done(), there is still no content at the intended output
>>> path,
>>> and things get even more wierd:
>>>
>>> The log output states
>>>
>>> 2013-06-25 16:41:12 FileOutputCommitter:173 [INFO] Saved output of
>>> task 'attempt_local_0001_m_000000_0' to
>>> /tmp/crunch-1483549519/p1/output
>>>
>>> but the directory
>>> /tmp/crunch-1483549519
>>>
>>> does not exist. It looks like this directory gets temporarily created
>>> during the run, but gets removed again when the program finishes.
>>> It gets kept around with run() and removed with done().
>>>
>>> Best,
>>>
>>> Florian
>>>
>>>
>>>
>>> On Tue, Jun 25, 2013 at 4:37 PM, Josh Wills <jo...@gmail.com> wrote:
>>> > Hey Florian,
>>> >
>>> > At first glance, it seems like a bug to me. I'm curious if the result
>>> > is any
>>> > different if you swap in pipeline.done() for pipeline.run()?
>>> >
>>> > J
>>> >
>>> >
>>> > On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <fl...@florianlaws.de>
>>> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I'm trying to write a simple Crunch job that outputs a sequence file
>>> >> consisting of a custom Writable.
>>> >>
>>> >> The job runs successfully, but the output is not written to the path
>>> >> that I specify in To.sequenceFile(),
>>> >> but instead to a Crunch working directory.
>>> >>
>>> >> This happens when running the job both locally and on my 1-node Hadoop
>>> >> test cluster,
>>> >> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
>>> >> (38a97e5).
>>> >>
>>> >> Code snippet:
>>> >> ---
>>> >>
>>> >> public int run(String[] args) throws IOException {
>>> >>   CommandLine cl = parseCommandLine(args);
>>> >>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>>> >>   int docIdIndex = getColumnIndex(cl, "DocID");
>>> >>   int ldaIndex = getColumnIndex(cl, "LDA");
>>> >>
>>> >>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>>> >>   pipeline.setConfiguration(getConf());
>>> >>   PCollection<String> lines = pipeline.readTextFile((String)
>>> >> cl.getValue(INPUT_OPTION));
>>> >>   PTable<String, NamedQuantizedVecWritable> vectors =
>>> >> lines.parallelDo(
>>> >>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>>> >>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>>> >>
>>> >>   vectors.write(To.sequenceFile(output));
>>> >>
>>> >>   PipelineResult res = pipeline.run();
>>> >>   return res.succeeded() ? 0 : 1;
>>> >> }
>>> >> ---
>>> >>
>>> >> Log output from local run.
>>> >> Note how the intended output path "/tmp/foo.seq" is reported in the
>>> >> execution plan,
>>> >> is not actually used.
>>> >> ---
>>> >>
>>> >> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
>>> >> from SCDynamicStore
>>> >> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
>>> >> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
>>> >> to new path: /tmp/foo.seq
>>> >> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
>>> >> classes may not be found. See JobConf(Class) or
>>> >> JobConf#setJar(String).
>>> >> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
>>> >> process : 1
>>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
>>> >> MAP in
>>> >>
>>> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
>>> >> with rwxr-xr-x
>>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
>>> >> /tmp/crunch-1128974463/p1/MAP as
>>> >>
>>> >>
>>> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
>>> >> /tmp/crunch-1128974463/p1/MAP as
>>> >>
>>> >>
>>> >> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>>> >>
>>> >> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
>>> >> "com.issuu.mahout.utils.DbDumpToSeqFile:
>>> >> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
>>> >>
>>> >> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
>>> >> available at: http://localhost:8080/
>>> >> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
>>> >> is done. And is in the process of commiting
>>> >> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
>>> >> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
>>> >> is allowed to commit now
>>> >>
>>> >> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
>>> >> task 'attempt_local_0001_m_000000_0' to
>>> >> /tmp/crunch-1128974463/p1/output
>>> >>
>>> >> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
>>> >> 2013-06-25 16:19:48 Task:904 [INFO] Task
>>> >> 'attempt_local_0001_m_000000_0'
>>> >> done.
>>> >>
>>> >> ---
>>> >>
>>> >>
>>> >> This crude patch makes the output end up at the right place,
>>> >> but breaks a lot of other tests.
>>> >> ---
>>> >>
>>> >> ---
>>> >>
>>> >> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>>> >> +++
>>> >>
>>> >> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>>> >> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>>> >>    protected void configureForMapReduce(Job job, Class keyClass, Class
>>> >> valueClass,
>>> >>        Class outputFormatClass, Path outputPath, String name) {
>>> >>      try {
>>> >> -      FileOutputFormat.setOutputPath(job, outputPath);
>>> >> +      FileOutputFormat.setOutputPath(job, path);
>>> >>      } catch (Exception e) {
>>> >>        throw new RuntimeException(e);
>>> >>      }
>>> >>
>>> >> ---
>>> >>
>>> >>
>>> >> Am I doing something wrong, or is this a bug?
>>> >>
>>> >> Best,
>>> >>
>>> >> Florian
>>> >
>>> >
>>
>>
>

Re: Write to sequence file ignores destination path.

Posted by Florian Laws <fl...@florianlaws.de>.
I'll open a JIRA ticket when I'm back at a real computer. The Hadoop
version should be 1.0.3.

Best,

Florian
Am 25.06.2013 17:28 schrieb "Josh Wills" <jo...@gmail.com>:

> Yeah, that's exactly how it works. I'll have some time to look at this
> more closely later this morning. Would you mind opening a JIRA and/or
> letting me know which Hadoop version you're running against?
>
> Thanks!
> Josh
>
>
> On Tue, Jun 25, 2013 at 7:46 AM, Florian Laws <fl...@florianlaws.de>wrote:
>
>> Hi Josh,
>>
>> thanks for the quick reply.
>>
>> With pipeline.done(), there is still no content at the intended output
>> path,
>> and things get even more wierd:
>>
>> The log output states
>>
>> 2013-06-25 16:41:12 FileOutputCommitter:173 [INFO] Saved output of
>> task 'attempt_local_0001_m_000000_0' to
>> /tmp/crunch-1483549519/p1/output
>>
>> but the directory
>> /tmp/crunch-1483549519
>>
>> does not exist. It looks like this directory gets temporarily created
>> during the run, but gets removed again when the program finishes.
>> It gets kept around with run() and removed with done().
>>
>> Best,
>>
>> Florian
>>
>>
>>
>> On Tue, Jun 25, 2013 at 4:37 PM, Josh Wills <jo...@gmail.com> wrote:
>> > Hey Florian,
>> >
>> > At first glance, it seems like a bug to me. I'm curious if the result
>> is any
>> > different if you swap in pipeline.done() for pipeline.run()?
>> >
>> > J
>> >
>> >
>> > On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <fl...@florianlaws.de>
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I'm trying to write a simple Crunch job that outputs a sequence file
>> >> consisting of a custom Writable.
>> >>
>> >> The job runs successfully, but the output is not written to the path
>> >> that I specify in To.sequenceFile(),
>> >> but instead to a Crunch working directory.
>> >>
>> >> This happens when running the job both locally and on my 1-node Hadoop
>> >> test cluster,
>> >> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
>> >> (38a97e5).
>> >>
>> >> Code snippet:
>> >> ---
>> >>
>> >> public int run(String[] args) throws IOException {
>> >>   CommandLine cl = parseCommandLine(args);
>> >>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>> >>   int docIdIndex = getColumnIndex(cl, "DocID");
>> >>   int ldaIndex = getColumnIndex(cl, "LDA");
>> >>
>> >>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>> >>   pipeline.setConfiguration(getConf());
>> >>   PCollection<String> lines = pipeline.readTextFile((String)
>> >> cl.getValue(INPUT_OPTION));
>> >>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
>> >>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>> >>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>> >>
>> >>   vectors.write(To.sequenceFile(output));
>> >>
>> >>   PipelineResult res = pipeline.run();
>> >>   return res.succeeded() ? 0 : 1;
>> >> }
>> >> ---
>> >>
>> >> Log output from local run.
>> >> Note how the intended output path "/tmp/foo.seq" is reported in the
>> >> execution plan,
>> >> is not actually used.
>> >> ---
>> >>
>> >> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
>> >> from SCDynamicStore
>> >> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
>> >> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
>> >> to new path: /tmp/foo.seq
>> >> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
>> >> classes may not be found. See JobConf(Class) or
>> >> JobConf#setJar(String).
>> >> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
>> >> process : 1
>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
>> >> MAP in
>> >>
>> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
>> >> with rwxr-xr-x
>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
>> >> /tmp/crunch-1128974463/p1/MAP as
>> >>
>> >>
>> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
>> >> /tmp/crunch-1128974463/p1/MAP as
>> >>
>> >>
>> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>> >>
>> >> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
>> >> "com.issuu.mahout.utils.DbDumpToSeqFile:
>> >> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
>> >>
>> >> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
>> >> available at: http://localhost:8080/
>> >> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
>> >> is done. And is in the process of commiting
>> >> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
>> >> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
>> >> is allowed to commit now
>> >>
>> >> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
>> >> task 'attempt_local_0001_m_000000_0' to
>> >> /tmp/crunch-1128974463/p1/output
>> >>
>> >> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
>> >> 2013-06-25 16:19:48 Task:904 [INFO] Task
>> 'attempt_local_0001_m_000000_0'
>> >> done.
>> >>
>> >> ---
>> >>
>> >>
>> >> This crude patch makes the output end up at the right place,
>> >> but breaks a lot of other tests.
>> >> ---
>> >>
>> >> ---
>> >>
>> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>> >> +++
>> >>
>> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>> >> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>> >>    protected void configureForMapReduce(Job job, Class keyClass, Class
>> >> valueClass,
>> >>        Class outputFormatClass, Path outputPath, String name) {
>> >>      try {
>> >> -      FileOutputFormat.setOutputPath(job, outputPath);
>> >> +      FileOutputFormat.setOutputPath(job, path);
>> >>      } catch (Exception e) {
>> >>        throw new RuntimeException(e);
>> >>      }
>> >>
>> >> ---
>> >>
>> >>
>> >> Am I doing something wrong, or is this a bug?
>> >>
>> >> Best,
>> >>
>> >> Florian
>> >
>> >
>>
>
>

Re: Write to sequence file ignores destination path.

Posted by Josh Wills <jo...@gmail.com>.
Yeah, that's exactly how it works. I'll have some time to look at this more
closely later this morning. Would you mind opening a JIRA and/or letting me
know which Hadoop version you're running against?

Thanks!
Josh


On Tue, Jun 25, 2013 at 7:46 AM, Florian Laws <fl...@florianlaws.de>wrote:

> Hi Josh,
>
> thanks for the quick reply.
>
> With pipeline.done(), there is still no content at the intended output
> path,
> and things get even more wierd:
>
> The log output states
>
> 2013-06-25 16:41:12 FileOutputCommitter:173 [INFO] Saved output of
> task 'attempt_local_0001_m_000000_0' to
> /tmp/crunch-1483549519/p1/output
>
> but the directory
> /tmp/crunch-1483549519
>
> does not exist. It looks like this directory gets temporarily created
> during the run, but gets removed again when the program finishes.
> It gets kept around with run() and removed with done().
>
> Best,
>
> Florian
>
>
>
> On Tue, Jun 25, 2013 at 4:37 PM, Josh Wills <jo...@gmail.com> wrote:
> > Hey Florian,
> >
> > At first glance, it seems like a bug to me. I'm curious if the result is
> any
> > different if you swap in pipeline.done() for pipeline.run()?
> >
> > J
> >
> >
> > On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <fl...@florianlaws.de>
> > wrote:
> >>
> >> Hi all,
> >>
> >> I'm trying to write a simple Crunch job that outputs a sequence file
> >> consisting of a custom Writable.
> >>
> >> The job runs successfully, but the output is not written to the path
> >> that I specify in To.sequenceFile(),
> >> but instead to a Crunch working directory.
> >>
> >> This happens when running the job both locally and on my 1-node Hadoop
> >> test cluster,
> >> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
> >> (38a97e5).
> >>
> >> Code snippet:
> >> ---
> >>
> >> public int run(String[] args) throws IOException {
> >>   CommandLine cl = parseCommandLine(args);
> >>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
> >>   int docIdIndex = getColumnIndex(cl, "DocID");
> >>   int ldaIndex = getColumnIndex(cl, "LDA");
> >>
> >>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
> >>   pipeline.setConfiguration(getConf());
> >>   PCollection<String> lines = pipeline.readTextFile((String)
> >> cl.getValue(INPUT_OPTION));
> >>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
> >>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
> >>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
> >>
> >>   vectors.write(To.sequenceFile(output));
> >>
> >>   PipelineResult res = pipeline.run();
> >>   return res.succeeded() ? 0 : 1;
> >> }
> >> ---
> >>
> >> Log output from local run.
> >> Note how the intended output path "/tmp/foo.seq" is reported in the
> >> execution plan,
> >> is not actually used.
> >> ---
> >>
> >> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
> >> from SCDynamicStore
> >> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
> >> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
> >> to new path: /tmp/foo.seq
> >> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
> >> classes may not be found. See JobConf(Class) or
> >> JobConf#setJar(String).
> >> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
> >> process : 1
> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
> >> MAP in
> >>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
> >> with rwxr-xr-x
> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
> >> /tmp/crunch-1128974463/p1/MAP as
> >>
> >>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> >> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
> >> /tmp/crunch-1128974463/p1/MAP as
> >>
> >>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> >>
> >> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
> >> "com.issuu.mahout.utils.DbDumpToSeqFile:
> >> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
> >>
> >> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
> >> available at: http://localhost:8080/
> >> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
> >> is done. And is in the process of commiting
> >> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
> >> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
> >> is allowed to commit now
> >>
> >> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
> >> task 'attempt_local_0001_m_000000_0' to
> >> /tmp/crunch-1128974463/p1/output
> >>
> >> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
> >> 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0'
> >> done.
> >>
> >> ---
> >>
> >>
> >> This crude patch makes the output end up at the right place,
> >> but breaks a lot of other tests.
> >> ---
> >>
> >> ---
> >>
> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> >> +++
> >>
> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> >> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
> >>    protected void configureForMapReduce(Job job, Class keyClass, Class
> >> valueClass,
> >>        Class outputFormatClass, Path outputPath, String name) {
> >>      try {
> >> -      FileOutputFormat.setOutputPath(job, outputPath);
> >> +      FileOutputFormat.setOutputPath(job, path);
> >>      } catch (Exception e) {
> >>        throw new RuntimeException(e);
> >>      }
> >>
> >> ---
> >>
> >>
> >> Am I doing something wrong, or is this a bug?
> >>
> >> Best,
> >>
> >> Florian
> >
> >
>

Re: Write to sequence file ignores destination path.

Posted by Florian Laws <fl...@florianlaws.de>.
Hi Josh,

thanks for the quick reply.

With pipeline.done(), there is still no content at the intended output path,
and things get even more wierd:

The log output states

2013-06-25 16:41:12 FileOutputCommitter:173 [INFO] Saved output of
task 'attempt_local_0001_m_000000_0' to
/tmp/crunch-1483549519/p1/output

but the directory
/tmp/crunch-1483549519

does not exist. It looks like this directory gets temporarily created
during the run, but gets removed again when the program finishes.
It gets kept around with run() and removed with done().

Best,

Florian



On Tue, Jun 25, 2013 at 4:37 PM, Josh Wills <jo...@gmail.com> wrote:
> Hey Florian,
>
> At first glance, it seems like a bug to me. I'm curious if the result is any
> different if you swap in pipeline.done() for pipeline.run()?
>
> J
>
>
> On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <fl...@florianlaws.de>
> wrote:
>>
>> Hi all,
>>
>> I'm trying to write a simple Crunch job that outputs a sequence file
>> consisting of a custom Writable.
>>
>> The job runs successfully, but the output is not written to the path
>> that I specify in To.sequenceFile(),
>> but instead to a Crunch working directory.
>>
>> This happens when running the job both locally and on my 1-node Hadoop
>> test cluster,
>> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
>> (38a97e5).
>>
>> Code snippet:
>> ---
>>
>> public int run(String[] args) throws IOException {
>>   CommandLine cl = parseCommandLine(args);
>>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>>   int docIdIndex = getColumnIndex(cl, "DocID");
>>   int ldaIndex = getColumnIndex(cl, "LDA");
>>
>>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>>   pipeline.setConfiguration(getConf());
>>   PCollection<String> lines = pipeline.readTextFile((String)
>> cl.getValue(INPUT_OPTION));
>>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
>>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>>
>>   vectors.write(To.sequenceFile(output));
>>
>>   PipelineResult res = pipeline.run();
>>   return res.succeeded() ? 0 : 1;
>> }
>> ---
>>
>> Log output from local run.
>> Note how the intended output path "/tmp/foo.seq" is reported in the
>> execution plan,
>> is not actually used.
>> ---
>>
>> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
>> from SCDynamicStore
>> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
>> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
>> to new path: /tmp/foo.seq
>> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
>> classes may not be found. See JobConf(Class) or
>> JobConf#setJar(String).
>> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
>> process : 1
>> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
>> MAP in
>> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
>> with rwxr-xr-x
>> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
>> /tmp/crunch-1128974463/p1/MAP as
>>
>> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
>> /tmp/crunch-1128974463/p1/MAP as
>>
>> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>>
>> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
>> "com.issuu.mahout.utils.DbDumpToSeqFile:
>> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
>>
>> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
>> available at: http://localhost:8080/
>> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
>> is done. And is in the process of commiting
>> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
>> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
>> is allowed to commit now
>>
>> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
>> task 'attempt_local_0001_m_000000_0' to
>> /tmp/crunch-1128974463/p1/output
>>
>> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
>> 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0'
>> done.
>>
>> ---
>>
>>
>> This crude patch makes the output end up at the right place,
>> but breaks a lot of other tests.
>> ---
>>
>> ---
>> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>> +++
>> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
>> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>>    protected void configureForMapReduce(Job job, Class keyClass, Class
>> valueClass,
>>        Class outputFormatClass, Path outputPath, String name) {
>>      try {
>> -      FileOutputFormat.setOutputPath(job, outputPath);
>> +      FileOutputFormat.setOutputPath(job, path);
>>      } catch (Exception e) {
>>        throw new RuntimeException(e);
>>      }
>>
>> ---
>>
>>
>> Am I doing something wrong, or is this a bug?
>>
>> Best,
>>
>> Florian
>
>

Re: Write to sequence file ignores destination path.

Posted by Josh Wills <jo...@gmail.com>.
Hey Florian,

At first glance, it seems like a bug to me. I'm curious if the result is
any different if you swap in pipeline.done() for pipeline.run()?

J


On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <fl...@florianlaws.de>wrote:

> Hi all,
>
> I'm trying to write a simple Crunch job that outputs a sequence file
> consisting of a custom Writable.
>
> The job runs successfully, but the output is not written to the path
> that I specify in To.sequenceFile(),
> but instead to a Crunch working directory.
>
> This happens when running the job both locally and on my 1-node Hadoop
> test cluster,
> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
> (38a97e5).
>
> Code snippet:
> ---
>
> public int run(String[] args) throws IOException {
>   CommandLine cl = parseCommandLine(args);
>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>   int docIdIndex = getColumnIndex(cl, "DocID");
>   int ldaIndex = getColumnIndex(cl, "LDA");
>
>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>   pipeline.setConfiguration(getConf());
>   PCollection<String> lines = pipeline.readTextFile((String)
> cl.getValue(INPUT_OPTION));
>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>
>   vectors.write(To.sequenceFile(output));
>
>   PipelineResult res = pipeline.run();
>   return res.succeeded() ? 0 : 1;
> }
> ---
>
> Log output from local run.
> Note how the intended output path "/tmp/foo.seq" is reported in the
> execution plan,
> is not actually used.
> ---
>
> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
> from SCDynamicStore
> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
> to new path: /tmp/foo.seq
> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
> classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
> process : 1
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
> MAP in
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
> with rwxr-xr-x
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>
> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
> "com.issuu.mahout.utils.DbDumpToSeqFile:
> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
>
> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
> available at: http://localhost:8080/
> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
> is done. And is in the process of commiting
> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
> is allowed to commit now
>
> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
> task 'attempt_local_0001_m_000000_0' to
> /tmp/crunch-1128974463/p1/output
>
> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0'
> done.
>
> ---
>
>
> This crude patch makes the output end up at the right place,
> but breaks a lot of other tests.
> ---
>
> ---
> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> +++
> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>    protected void configureForMapReduce(Job job, Class keyClass, Class
> valueClass,
>        Class outputFormatClass, Path outputPath, String name) {
>      try {
> -      FileOutputFormat.setOutputPath(job, outputPath);
> +      FileOutputFormat.setOutputPath(job, path);
>      } catch (Exception e) {
>        throw new RuntimeException(e);
>      }
>
> ---
>
>
> Am I doing something wrong, or is this a bug?
>
> Best,
>
> Florian
>