You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stuart White <st...@gmail.com> on 2008/12/11 18:05:00 UTC

-libjars with multiple jars broken when client and cluster reside on different OSs?

I've written a simple map/reduce job that demonstrates a problem I'm
having.  Please see attached example.

Environment:
  hadoop 0.19.0
  cluster resides across linux nodes
  client resides on cygwin

To recreate the problem I'm seeing, do the following:

- Setup a hadoop cluster on linux

- Perform the remaining steps on cygwin, with a hadoop installation
configured to point to the linux cluster.  (set fs.default.name and
mapred.job.tracker)

- Extract the tarball.  Change into created directory.
  tar xvfz Example.tar.gz
  cd Example

- Edit build.properties, set your hadoop.home appropriately, then
build the example.
  ant

- Load the file Example.in into your dfs
  hadoop dfs -copyFromLocal Example.in Example.in

- Execute the provided shell script, passing it testID 1.
  ./Example.sh 1
  This test does not use -libjars, and it completes successfully.

- Next, execute testID 2.
  ./Example.sh 2
  This test uses -libjars with 1 jarfile (Foo.jar), and it completes
successfully.

- Next, execute testID 3.
  ./Example.sh 3
  This test uses -libjars with 1 jarfile (Bar.jar), and it completes
successfully.

- Next, execute testID 4.
  ./Example.sh 4
  This test uses -libjars with 2 jarfiles (Foo.jar and Bar.jar), and
it fails with a ClassNotFoundException.

This behavior only occurs when calling from cygwin to linux or vice
versa.   If both the cluster and the client reside on either linux or
cygwin, the problem does not occur.

I'm continuing to dig to see what I can figure out, but since I'm very
new to hadoop (started using it this week), I thought I'd go ahead and
throw this out there to see if anyone can help.

Thanks!

Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

Posted by Stuart White <st...@gmail.com>.

I agree.  Using a List<String> seems to make more sense.

FYI... I opened a jira for this:
https://issues.apache.org/jira/browse/HADOOP-4864

On Tue, Dec 30, 2008 at 3:53 PM, Jason Venner <ja...@attributor.com> wrote:

> The path separator is a major issue with a number of items in the
> configuration data set that are multiple items packed together via the path
> separator.
> the class path
> the distributed cache
> the input path set
>
> all suffer from the path.separator issue for 2 reasons:
> 1 being the difference across jvms as indicated in the previous email item
> (I had missed this!)
> 2 separator characters that happen to be embedded in the individual
> elements are not escaped before the item is added to the existing set.
>
> For all of the pain we have with these packed items, it may be simpler to
> serialize a List<String> for multi element items rather than packing them
> with the path.separator system property item.
>
>
>
> Aaron Kimball wrote:
>
>> Hi Stuart,
>>
>> Good sleuthing out that problem :) The correct way to submit patches is to
>> file a ticket on JIRA (https://issues.apache.org/jira/browse/HADOOP).
>> Create
>> an account, create a new issue describing the bug, and then attach the
>> patch
>> file. There'll be a discussion there and others can review your patch and
>> include it in the codebase.
>>
>> Cheers,
>> - Aaron
>>
>> On Fri, Dec 12, 2008 at 12:14 PM, Stuart White <stuart.white1@gmail.com
>> >wrote:
>>
>>
>>
>>> Ok, I'll answer my own question.
>>>
>>> This is caused by the fact that hadoop uses
>>> system.getProperty("path.separator") as the delimiter in the list of
>>> jar files passed via -libjars.
>>>
>>> If your job spans platforms, system.getProperty("path.separator")
>>> returns a different delimiter on the different platforms.
>>>
>>> My solution is to use a comma as the delimiter, rather than the
>>> path.separator.
>>>
>>> I realize comma is, perhaps, a poor choice for a delimiter because it
>>> is valid in filenames on both Windows and Linux, but the -libjars uses
>>> it as the delimiter when listing the additional required jars.  So, I
>>> figured if it's already being used as a delimiter, then it's
>>> reasonable to use it internally as well.
>>>
>>> I've attached a patch (against 0.19.0) that applies this change.
>>>
>>> Now, with this change, I can submit hadoop jobs (requiring multiple
>>> supporting jars) from my Windows laptop (via cygwin) to my 10-node
>>> Linux hadoop cluster.
>>>
>>> Any chance this change could be applied to the hadoop codebase?
>>>
>>>
>>>
>>
>>
>>
>

Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

Posted by Jason Venner <ja...@attributor.com>.

The path separator is a major issue with a number of items in the 
configuration data set that are multiple items packed together via the 
path separator.
the class path
the distributed cache
the input path set

all suffer from the path.separator issue for 2 reasons:
 1 being the difference across jvms as indicated in the previous email 
item (I had missed this!)
 2 separator characters that happen to be embedded in the individual 
elements are not escaped before the item is added to the existing set.

For all of the pain we have with these packed items, it may be simpler 
to serialize a List<String> for multi element items rather than packing 
them with the path.separator system property item.

Aaron Kimball wrote:
> Hi Stuart,
>
> Good sleuthing out that problem :) The correct way to submit patches is to
> file a ticket on JIRA (https://issues.apache.org/jira/browse/HADOOP). Create
> an account, create a new issue describing the bug, and then attach the patch
> file. There'll be a discussion there and others can review your patch and
> include it in the codebase.
>
> Cheers,
> - Aaron
>
> On Fri, Dec 12, 2008 at 12:14 PM, Stuart White <st...@gmail.com>wrote:
>
>   
>> Ok, I'll answer my own question.
>>
>> This is caused by the fact that hadoop uses
>> system.getProperty("path.separator") as the delimiter in the list of
>> jar files passed via -libjars.
>>
>> If your job spans platforms, system.getProperty("path.separator")
>> returns a different delimiter on the different platforms.
>>
>> My solution is to use a comma as the delimiter, rather than the
>> path.separator.
>>
>> I realize comma is, perhaps, a poor choice for a delimiter because it
>> is valid in filenames on both Windows and Linux, but the -libjars uses
>> it as the delimiter when listing the additional required jars.  So, I
>> figured if it's already being used as a delimiter, then it's
>> reasonable to use it internally as well.
>>
>> I've attached a patch (against 0.19.0) that applies this change.
>>
>> Now, with this change, I can submit hadoop jobs (requiring multiple
>> supporting jars) from my Windows laptop (via cygwin) to my 10-node
>> Linux hadoop cluster.
>>
>> Any chance this change could be applied to the hadoop codebase?
>>
>>     
>
>

Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

Posted by Aaron Kimball <aa...@cloudera.com>.

Hi Stuart,

Good sleuthing out that problem :) The correct way to submit patches is to
file a ticket on JIRA (https://issues.apache.org/jira/browse/HADOOP). Create
an account, create a new issue describing the bug, and then attach the patch
file. There'll be a discussion there and others can review your patch and
include it in the codebase.

Cheers,
- Aaron

On Fri, Dec 12, 2008 at 12:14 PM, Stuart White <st...@gmail.com>wrote:

> Ok, I'll answer my own question.
>
> This is caused by the fact that hadoop uses
> system.getProperty("path.separator") as the delimiter in the list of
> jar files passed via -libjars.
>
> If your job spans platforms, system.getProperty("path.separator")
> returns a different delimiter on the different platforms.
>
> My solution is to use a comma as the delimiter, rather than the
> path.separator.
>
> I realize comma is, perhaps, a poor choice for a delimiter because it
> is valid in filenames on both Windows and Linux, but the -libjars uses
> it as the delimiter when listing the additional required jars.  So, I
> figured if it's already being used as a delimiter, then it's
> reasonable to use it internally as well.
>
> I've attached a patch (against 0.19.0) that applies this change.
>
> Now, with this change, I can submit hadoop jobs (requiring multiple
> supporting jars) from my Windows laptop (via cygwin) to my 10-node
> Linux hadoop cluster.
>
> Any chance this change could be applied to the hadoop codebase?
>

Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

Posted by Stuart White <st...@gmail.com>.

Ok, I'll answer my own question.

This is caused by the fact that hadoop uses
system.getProperty("path.separator") as the delimiter in the list of
jar files passed via -libjars.

If your job spans platforms, system.getProperty("path.separator")
returns a different delimiter on the different platforms.

My solution is to use a comma as the delimiter, rather than the path.separator.

I realize comma is, perhaps, a poor choice for a delimiter because it
is valid in filenames on both Windows and Linux, but the -libjars uses
it as the delimiter when listing the additional required jars.  So, I
figured if it's already being used as a delimiter, then it's
reasonable to use it internally as well.

I've attached a patch (against 0.19.0) that applies this change.

Now, with this change, I can submit hadoop jobs (requiring multiple
supporting jars) from my Windows laptop (via cygwin) to my 10-node
Linux hadoop cluster.

Any chance this change could be applied to the hadoop codebase?