You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2014/01/16 23:41:36 UTC

DistributedCache is empty

My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.

That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.

What on Earth am I doing wrong here?

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: DistributedCache is empty

Posted by Keith Wiley <kw...@keithwiley.com>.
2.0.0

The problem was I was creating a new Configuration and giving it to the Job ctor (which I believe is demonstrated in some tutorials) whereas the correct behavior was to retrieve the preexisting Configuration and use that instead.  This may be a distinction between writing a bare driver and one that overrides Configured and Tool.

On Jan 17, 2014, at 09:46 , Vinod Kumar Vavilapalli wrote:

> What is the version of Hadoop that you are using?
> 
> +Vinod
> 
> On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> 
>> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
>> 
>> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
>> 
>> What on Earth am I doing wrong here?
>> 
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "Luminous beings are we, not this crude matter."
>>                                          --  Yoda
>> ________________________________________________________________________________
>> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: DistributedCache is empty

Posted by Keith Wiley <kw...@keithwiley.com>.
2.0.0

The problem was I was creating a new Configuration and giving it to the Job ctor (which I believe is demonstrated in some tutorials) whereas the correct behavior was to retrieve the preexisting Configuration and use that instead.  This may be a distinction between writing a bare driver and one that overrides Configured and Tool.

On Jan 17, 2014, at 09:46 , Vinod Kumar Vavilapalli wrote:

> What is the version of Hadoop that you are using?
> 
> +Vinod
> 
> On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> 
>> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
>> 
>> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
>> 
>> What on Earth am I doing wrong here?
>> 
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "Luminous beings are we, not this crude matter."
>>                                          --  Yoda
>> ________________________________________________________________________________
>> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: DistributedCache is empty

Posted by Keith Wiley <kw...@keithwiley.com>.
2.0.0

The problem was I was creating a new Configuration and giving it to the Job ctor (which I believe is demonstrated in some tutorials) whereas the correct behavior was to retrieve the preexisting Configuration and use that instead.  This may be a distinction between writing a bare driver and one that overrides Configured and Tool.

On Jan 17, 2014, at 09:46 , Vinod Kumar Vavilapalli wrote:

> What is the version of Hadoop that you are using?
> 
> +Vinod
> 
> On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> 
>> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
>> 
>> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
>> 
>> What on Earth am I doing wrong here?
>> 
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "Luminous beings are we, not this crude matter."
>>                                          --  Yoda
>> ________________________________________________________________________________
>> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: DistributedCache is empty

Posted by Keith Wiley <kw...@keithwiley.com>.
2.0.0

The problem was I was creating a new Configuration and giving it to the Job ctor (which I believe is demonstrated in some tutorials) whereas the correct behavior was to retrieve the preexisting Configuration and use that instead.  This may be a distinction between writing a bare driver and one that overrides Configured and Tool.

On Jan 17, 2014, at 09:46 , Vinod Kumar Vavilapalli wrote:

> What is the version of Hadoop that you are using?
> 
> +Vinod
> 
> On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> 
>> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
>> 
>> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
>> 
>> What on Earth am I doing wrong here?
>> 
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "Luminous beings are we, not this crude matter."
>>                                          --  Yoda
>> ________________________________________________________________________________
>> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: DistributedCache is empty

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
What is the version of Hadoop that you are using?

+Vinod

On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
> 
> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
> 
> What on Earth am I doing wrong here?
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
> ________________________________________________________________________________
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: DistributedCache is empty

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
What is the version of Hadoop that you are using?

+Vinod

On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
> 
> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
> 
> What on Earth am I doing wrong here?
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
> ________________________________________________________________________________
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: DistributedCache is empty

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
What is the version of Hadoop that you are using?

+Vinod

On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
> 
> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
> 
> What on Earth am I doing wrong here?
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
> ________________________________________________________________________________
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: DistributedCache is empty

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
What is the version of Hadoop that you are using?

+Vinod

On Jan 16, 2014, at 2:41 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> My driver is implemented around Tool and so should be wrapping GenericOptionsParser internally.  Nevertheless, neither -files nor DistributedCache methods seem to work.  Usage on the command line is straight forward, I simply add "-files foo.py,bar.py" right after the class name (where those files are in the current directory I'm running hadoop from, i.e., the local nonHDFS filesystem).  The mapper then inspects the file list via DistributedCache.getLocalCacheFiles(context.getConfiguration()) and doesn't see the files, there's nothing there.  Likewise, if I attempt to run those python scripts from the mapper using hadoop.util.Shell, the files obviously can't be found.
> 
> That should have worked, so I shouldn't have to rely on the DC methods, but nevertheless, I tried anyway, so in the driver I create a new Configuration, then call DistributedCache.addCacheFile(new URI("./foo.py"), conf), thus referencing the local nonHDFS file in the current working directory.  I then add conf to the job ctor, seems straight forward.  Still no dice, the mapper can't see the files, they simply aren't there.
> 
> What on Earth am I doing wrong here?
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
> ________________________________________________________________________________
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.