You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Miguel Paraz <mp...@gmail.com> on 2013/12/19 17:42:22 UTC

Copying to DistributedCache using -files

Hi,
I'm studying Crunch with code that relies on the DistributedCache to copy
files to the local filesystem. (My code is at
https://bitbucket.org/mparaz/maxmind-crunch)

I'm using 0.9.0-mapreduce2 on a 2.2.0 setup (Hortonworks Sandbox 2.0).

I see that Crunch programs use the same pattern as low-level MapReduce,
with ToolRunner.run() and implementing Tool.run().

Unfortunately, the file I specify with the "-files" parameter is not copied.
I logged getConf().get("tmpfiles") and that configuration entry is there.

At which point should the file copied? I looked through the Hadoop source
code and found that tmpfiles is processed in
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
- copyAndConfigureFiles()

Is this code not invoked when Crunch is used?
This works with the equivalent MapReduce 2.2.0 API code.

Is there are a working example with distributed files that I could try?

Thanks!
Miguel

Re: Copying to DistributedCache using -files

Posted by Josh Wills <jw...@cloudera.com>.

No problem. I've never understood it either, just one of those things I
noticed a long time ago. :)


On Thu, Dec 19, 2013 at 8:26 PM, Miguel Paraz <mp...@gmail.com> wrote:

> Hi Josh,
>
> It's working now. Thanks for helping with my newbie question, and looking
> at the code.
>
> Confusing that omitting the new Configuration() works with the plain
> MapReduce API.
>
> Cheers,
> Miguel
>
>
> On Fri, Dec 20, 2013 at 3:38 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Miguel,
>>
>> You need to call:
>>
>> ToolRunner.run(new MaxmindCrunchJob(), args, new Configuration());
>>
>> in main() to pickup the args from the commandline.
>>
>> J
>>
>>
>> On Thu, Dec 19, 2013 at 8:42 AM, Miguel Paraz <mp...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm studying Crunch with code that relies on the DistributedCache to
>>> copy files to the local filesystem. (My code is at
>>> https://bitbucket.org/mparaz/maxmind-crunch)
>>>
>>> I'm using 0.9.0-mapreduce2 on a 2.2.0 setup (Hortonworks Sandbox 2.0).
>>>
>>> I see that Crunch programs use the same pattern as low-level MapReduce,
>>> with ToolRunner.run() and implementing Tool.run().
>>>
>>> Unfortunately, the file I specify with the "-files" parameter is not
>>> copied.
>>> I logged getConf().get("tmpfiles") and that configuration entry is there.
>>>
>>> At which point should the file copied? I looked through the Hadoop
>>> source code and found that tmpfiles is processed in
>>> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
>>> - copyAndConfigureFiles()
>>>
>>> Is this code not invoked when Crunch is used?
>>> This works with the equivalent MapReduce 2.2.0 API code.
>>>
>>> Is there are a working example with distributed files that I could try?
>>>
>>> Thanks!
>>> Miguel
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Copying to DistributedCache using -files

Posted by Miguel Paraz <mp...@gmail.com>.

Hi Josh,

It's working now. Thanks for helping with my newbie question, and looking
at the code.

Confusing that omitting the new Configuration() works with the plain
MapReduce API.

Cheers,
Miguel


On Fri, Dec 20, 2013 at 3:38 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Miguel,
>
> You need to call:
>
> ToolRunner.run(new MaxmindCrunchJob(), args, new Configuration());
>
> in main() to pickup the args from the commandline.
>
> J
>
>
> On Thu, Dec 19, 2013 at 8:42 AM, Miguel Paraz <mp...@gmail.com> wrote:
>
>> Hi,
>> I'm studying Crunch with code that relies on the DistributedCache to copy
>> files to the local filesystem. (My code is at
>> https://bitbucket.org/mparaz/maxmind-crunch)
>>
>> I'm using 0.9.0-mapreduce2 on a 2.2.0 setup (Hortonworks Sandbox 2.0).
>>
>> I see that Crunch programs use the same pattern as low-level MapReduce,
>> with ToolRunner.run() and implementing Tool.run().
>>
>> Unfortunately, the file I specify with the "-files" parameter is not
>> copied.
>> I logged getConf().get("tmpfiles") and that configuration entry is there.
>>
>> At which point should the file copied? I looked through the Hadoop source
>> code and found that tmpfiles is processed in
>> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
>> - copyAndConfigureFiles()
>>
>> Is this code not invoked when Crunch is used?
>> This works with the equivalent MapReduce 2.2.0 API code.
>>
>> Is there are a working example with distributed files that I could try?
>>
>> Thanks!
>> Miguel
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Copying to DistributedCache using -files

Posted by Josh Wills <jw...@cloudera.com>.

Hey Miguel,

You need to call:

ToolRunner.run(new MaxmindCrunchJob(), args, new Configuration());

in main() to pickup the args from the commandline.

J


On Thu, Dec 19, 2013 at 8:42 AM, Miguel Paraz <mp...@gmail.com> wrote:

> Hi,
> I'm studying Crunch with code that relies on the DistributedCache to copy
> files to the local filesystem. (My code is at
> https://bitbucket.org/mparaz/maxmind-crunch)
>
> I'm using 0.9.0-mapreduce2 on a 2.2.0 setup (Hortonworks Sandbox 2.0).
>
> I see that Crunch programs use the same pattern as low-level MapReduce,
> with ToolRunner.run() and implementing Tool.run().
>
> Unfortunately, the file I specify with the "-files" parameter is not
> copied.
> I logged getConf().get("tmpfiles") and that configuration entry is there.
>
> At which point should the file copied? I looked through the Hadoop source
> code and found that tmpfiles is processed in
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
> - copyAndConfigureFiles()
>
> Is this code not invoked when Crunch is used?
> This works with the equivalent MapReduce 2.2.0 API code.
>
> Is there are a working example with distributed files that I could try?
>
> Thanks!
> Miguel
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>