You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Kaluskar, Sanjay" <sk...@informatica.com> on 2010/08/11 09:48:28 UTC

Adding entries to classpath

I am using Hadoop indirectly through PIG, and some of the UDFs (defined
by me) need other jars at runtime (around 150) some of which have
conflicting resource names. Hence, trying to unpack all of them and
repacking into a single jar doesn't work. My solution is to create a
single top-level jar that names all the dependencies in Class-Path in
the MANIFEST.MF. This is also simpler from a user's point of view. Of
course this requires the top-level jar and all the dependencies to be
created with a certain directory structure that I can control.
Currently, I have a structure where I have a root directory which
contains the top-level jar and a directory called lib, and all the
dependencies are in lib, and the top-level jar names the dependencies as
lib/x.jar lib/y.jar etc. I package all of this as a single zip file for
easy installation.
 
Just to be clear this is the dir structure:
 
root dir
    |
    |--- top-level.jar
    |--- lib
            |--- x.jar
            |--- y.jar
 
I can't register top-level.jar in my PIG script (this is the recommended
approach) because PIG then unpacks & repackages everything into a single
jar, instead of including the jar on the classpath. I can't use
distributed cache because if I specify top-level.jar and lib separately
in mapred.cache.files, then the relative directory locations aren't
preserved. If I use the mapred.cache.archives option and specify the zip
file, I can't add the top-level jar to the classpath (because the
entries in mapred.job.classpath.files must be something from
mapred.cache.files).
 
If mapred.child.java.opts also allowed java.class.path to be augmented
(similar to java.library.path, which I am using for native libs that I
store in another dir parallel to lib), it would have solved my problem.
I could have specified the zip in mapred.cache.archives, and added the
jar to the classpath. Right now I can't see any solution, other than
using a shared file system and adding top-level.jar to HADOOP_CLASSPATH
- this works because I am using a small cluster that has a shared file
system but clearly it's not always feasible (and of course, it's
modifying Hadoop's environment).
 
Please suggest any alternatives you can think of.
 
Thanks,
-sanjay

Re: Adding entries to classpath

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

On Thursday 12 August 2010 07:56 PM, Kaluskar, Sanjay wrote:
> Hi Mridul,
> [BTW thanks, I am glad to see some help on this mailing list - I have
> been burnt out by this problem!]
>
> I am not sure I understand the short term solution. It seems like you
> are still suggesting over-writing some of the files; wouldn't that break
> some of the dependencies? Let me give you a very specific example.
> Suppose I write a UDF with 2 dependencies - a.jar and b.jar. Further
> suppose that both have a configuration file called types.xsd (stored as
> META-INF/resources/types.xsd in each jar), which is accessed at runtime
> (through this.getClass().getResource() by specifying the location of the
> file. Now, when I register both the jars, PIG will expand&  re-package
> everything into a single jar, which means that one of the types.xsd will
> be overwritten. This means that either a.jar or b.jar won't function as
> expected.

I was not thinking of meta-inf dependencies, my mistake - you are right, 
it will fail for that.
I was assuming only of class resolution : and typically, overwriting in 
reverse order should be relatively fine (it is not a general solution, 
there are corner cases where it will fail).


>
> That is the reason, I am aiming for a solution that lets me specify all
> the dependencies on the classpath. These 150 dependencies are pretty
> much like 3rd party software for me; I don't really understand them well
> enough or control them (and really, I shouldn't have to do that else it
> would get very hard to use any software).


In this case, using the URLClassLoader and reflection based second 
"solution" should probably work for you ?
You should be careful to ensure that no references to the actual 
business logic is made 'directly' - only though classes you create via 
reflection.

Regards,
Mridul

>
> Right now, my workaround is fairly robust but ugly - I am adding the
> top-level jar to HADOOP-CLASSPATH. That jar lists a.jar, b.jar, ... in
> the list of files in Class-Path in META-INF/MANIFEST.MF.
>
> -sanjay
>
> -----Original Message-----
> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com]
> Sent: Thursday, August 12, 2010 1:03 PM
> To: pig-user@hadoop.apache.org
> Cc: Kaluskar, Sanjay
> Subject: Re: Adding entries to classpath
>
>
> A short term alternative would be to find out the order in which pig
> expands the jars, and ensure that your jars are expanded in reverse
> order.
>
> As in, if you need your classpath to be "a.jar:b.jar:c.jar", and pig
> un-jar's the register'ed jar in the order they are specified in the
> script, then simply register them in reverse order -
>
> register c.jar;
> register b.jar;
> register a.jar;
>
> (I am assuming an order of expansion here, and also that there IS an
> order to begin with !).
>
> This would be consistent with how java loads the classes for most part
> (unless you have jar level tricky dependencies : I am ignoring that
> possibility for now).
>
> Worth a shot anyway while we wait for pig/hadoop to fix for next
> release.
>
>
>
> Another alternative might be to add all dependencies into an archive,
> 'expand' this in an init block in your udf, use a URLClassLoader to load
> this and use reflection to invoke your code : possibly I might be
> missing something, but it looks workable ...
>
>
> Regards,
> Mridul
>
>
> On Thursday 12 August 2010 07:49 AM, Kaluskar, Sanjay wrote:
>> Thanks Ashutosh, I will try that out.
>>
>> Arun,
>> I had already explained why I can't register the 150 jars (very
>> tedious, error prone and PIG then unpacks&   re-packs which ends up
>> over-writing some of the resource files that have the same names). I
>> also explained why dist cache doesn't work in this scenario (because
>> specifying the jars individually doesn't preserve the dir structure,
>> and specifying the zip file doesn't allow adding the jar to the
>> classpath). I have been trying this out for a few days using various
>> options suggested in the doc. Finally, I started reading the hadoop
>> source code and discovered why none of the solutions would work. I
>> would actually fix the mapred.child.java.opts to allow adding to the
>> classpath, if it were my choice because it is a generic solution, it
>> would be consistent with how java.library.path is handled. I would
>> also fix PIG to not try&   mangle all the registered jars - I have been
>
>> burnt by that. I think PIG should instead put all the registered jars
> on the classpath.
>>
>> -sanjay
>>
>> -----Original Message-----
>> From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com]
>> Sent: Wednesday, August 11, 2010 10:39 PM
>> To: mapreduce-user@hadoop.apache.org
>> Cc: pig-user@hadoop.apache.org
>> Subject: Re: Adding entries to classpath
>>
>> Adding pig-user@
>>
>> Sanjay,
>>
>> You can do this in Pig by setting following -D switch at the command
>> line through which you invoke Pig.
>> -Dpig.streaming.ship.files=myTopLevel.jar
>>
>> In 0.8 release you will be able to do this from within Pig script like
>
>> set pig.streaming.ship.files myTopLevel.jar;
>>
>> Note that this is just to unblock you. Its an internal Pig property
>> which is not exposed to the users and may break your script if your
>> are also using Streaming from within Pig. We need to find a long term
>> solution for your particular use case.
>>
>> Hope it helps,
>> Ashutosh
>>
>> On Wed, Aug 11, 2010 at 09:30, Arun C Murthy<ac...@yahoo-inc.com>
> wrote:
>>
>>> Moving to mapreduce-user@, bcc common-user@.
>>>
>>> Why do you need to create a single top-level jar? Just register each
>>> of your jars and put each in the distributed cache... however you
>>> have
>>
>>> 150 jars which is a lot. Is there a way you can decrease that? I'm
>>> sure how you do this in pig, but in MR you have the ability to add a
>>> jar in the DC to the classpath of the child
>> (DistributedCache.addFileToClassPath).
>>>
>>> Hope that helps.
>>>
>>> Arun
>>>
>>>
>>> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
>>>
>>>    I am using Hadoop indirectly through PIG, and some of the UDFs
>>> (defined
>>>> by me) need other jars at runtime (around 150) some of which have
>>>> conflicting resource names. Hence, trying to unpack all of them and
>>>> repacking into a single jar doesn't work. My solution is to create a
>
>>>> single top-level jar that names all the dependencies in Class-Path
>>>> in
>>
>>>> the MANIFEST.MF. This is also simpler from a user's point of view.
>>>> Of
>>
>>>> course this requires the top-level jar and all the dependencies to
>>>> be
>>
>>>> created with a certain directory structure that I can control.
>>>> Currently, I have a structure where I have a root directory which
>>>> contains the top-level jar and a directory called lib, and all the
>>>> dependencies are in lib, and the top-level jar names the
>>>> dependencies
>>
>>>> as lib/x.jar lib/y.jar etc. I package all of this as a single zip
>>>> file for easy installation.
>>>>
>>>> Just to be clear this is the dir structure:
>>>>
>>>> root dir
>>>>     |
>>>>     |--- top-level.jar
>>>>     |--- lib
>>>>             |--- x.jar
>>>>             |--- y.jar
>>>>
>>>> I can't register top-level.jar in my PIG script (this is the
>>>> recommended
>>>> approach) because PIG then unpacks&   repackages everything into a
>>>> single jar, instead of including the jar on the classpath. I can't
>>>> use distributed cache because if I specify top-level.jar and lib
>>>> separately in mapred.cache.files, then the relative directory
>>>> locations aren't preserved. If I use the mapred.cache.archives
>>>> option
>>
>>>> and specify the zip file, I can't add the top-level jar to the
>>>> classpath (because the entries in mapred.job.classpath.files must be
>
>>>> something from mapred.cache.files).
>>>>
>>>> If mapred.child.java.opts also allowed java.class.path to be
>>>> augmented (similar to java.library.path, which I am using for native
>
>>>> libs that I store in another dir parallel to lib), it would have
>> solved my problem.
>>>> I could have specified the zip in mapred.cache.archives, and added
>>>> the jar to the classpath. Right now I can't see any solution, other
>>>> than using a shared file system and adding top-level.jar to
>>>> HADOOP_CLASSPATH
>>>> - this works because I am using a small cluster that has a shared
>>>> file system but clearly it's not always feasible (and of course,
>>>> it's
>>
>>>> modifying Hadoop's environment).
>>>>
>>>> Please suggest any alternatives you can think of.
>>>>
>>>> Thanks,
>>>> -sanjay
>>>>
>>>
>>>
>

RE: Adding entries to classpath

Posted by "Kaluskar, Sanjay" <sk...@informatica.com>.

Hi Mridul,
[BTW thanks, I am glad to see some help on this mailing list - I have
been burnt out by this problem!]

I am not sure I understand the short term solution. It seems like you
are still suggesting over-writing some of the files; wouldn't that break
some of the dependencies? Let me give you a very specific example.
Suppose I write a UDF with 2 dependencies - a.jar and b.jar. Further
suppose that both have a configuration file called types.xsd (stored as
META-INF/resources/types.xsd in each jar), which is accessed at runtime
(through this.getClass().getResource() by specifying the location of the
file. Now, when I register both the jars, PIG will expand & re-package
everything into a single jar, which means that one of the types.xsd will
be overwritten. This means that either a.jar or b.jar won't function as
expected.

That is the reason, I am aiming for a solution that lets me specify all
the dependencies on the classpath. These 150 dependencies are pretty
much like 3rd party software for me; I don't really understand them well
enough or control them (and really, I shouldn't have to do that else it
would get very hard to use any software).

Right now, my workaround is fairly robust but ugly - I am adding the
top-level jar to HADOOP-CLASSPATH. That jar lists a.jar, b.jar, ... in
the list of files in Class-Path in META-INF/MANIFEST.MF.

-sanjay

-----Original Message-----
From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
Sent: Thursday, August 12, 2010 1:03 PM
To: pig-user@hadoop.apache.org
Cc: Kaluskar, Sanjay
Subject: Re: Adding entries to classpath


A short term alternative would be to find out the order in which pig
expands the jars, and ensure that your jars are expanded in reverse
order.

As in, if you need your classpath to be "a.jar:b.jar:c.jar", and pig
un-jar's the register'ed jar in the order they are specified in the
script, then simply register them in reverse order -

register c.jar;
register b.jar;
register a.jar;

(I am assuming an order of expansion here, and also that there IS an
order to begin with !).

This would be consistent with how java loads the classes for most part
(unless you have jar level tricky dependencies : I am ignoring that
possibility for now).

Worth a shot anyway while we wait for pig/hadoop to fix for next
release.



Another alternative might be to add all dependencies into an archive,
'expand' this in an init block in your udf, use a URLClassLoader to load
this and use reflection to invoke your code : possibly I might be
missing something, but it looks workable ...


Regards,
Mridul


On Thursday 12 August 2010 07:49 AM, Kaluskar, Sanjay wrote:
> Thanks Ashutosh, I will try that out.
>
> Arun,
> I had already explained why I can't register the 150 jars (very 
> tedious, error prone and PIG then unpacks&  re-packs which ends up 
> over-writing some of the resource files that have the same names). I 
> also explained why dist cache doesn't work in this scenario (because 
> specifying the jars individually doesn't preserve the dir structure, 
> and specifying the zip file doesn't allow adding the jar to the 
> classpath). I have been trying this out for a few days using various 
> options suggested in the doc. Finally, I started reading the hadoop 
> source code and discovered why none of the solutions would work. I 
> would actually fix the mapred.child.java.opts to allow adding to the 
> classpath, if it were my choice because it is a generic solution, it 
> would be consistent with how java.library.path is handled. I would 
> also fix PIG to not try&  mangle all the registered jars - I have been

> burnt by that. I think PIG should instead put all the registered jars
on the classpath.
>
> -sanjay
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com]
> Sent: Wednesday, August 11, 2010 10:39 PM
> To: mapreduce-user@hadoop.apache.org
> Cc: pig-user@hadoop.apache.org
> Subject: Re: Adding entries to classpath
>
> Adding pig-user@
>
> Sanjay,
>
> You can do this in Pig by setting following -D switch at the command 
> line through which you invoke Pig.
> -Dpig.streaming.ship.files=myTopLevel.jar
>
> In 0.8 release you will be able to do this from within Pig script like

> set pig.streaming.ship.files myTopLevel.jar;
>
> Note that this is just to unblock you. Its an internal Pig property 
> which is not exposed to the users and may break your script if your 
> are also using Streaming from within Pig. We need to find a long term 
> solution for your particular use case.
>
> Hope it helps,
> Ashutosh
>
> On Wed, Aug 11, 2010 at 09:30, Arun C Murthy<ac...@yahoo-inc.com>
wrote:
>
>> Moving to mapreduce-user@, bcc common-user@.
>>
>> Why do you need to create a single top-level jar? Just register each 
>> of your jars and put each in the distributed cache... however you 
>> have
>
>> 150 jars which is a lot. Is there a way you can decrease that? I'm 
>> sure how you do this in pig, but in MR you have the ability to add a 
>> jar in the DC to the classpath of the child
> (DistributedCache.addFileToClassPath).
>>
>> Hope that helps.
>>
>> Arun
>>
>>
>> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
>>
>>   I am using Hadoop indirectly through PIG, and some of the UDFs 
>> (defined
>>> by me) need other jars at runtime (around 150) some of which have 
>>> conflicting resource names. Hence, trying to unpack all of them and 
>>> repacking into a single jar doesn't work. My solution is to create a

>>> single top-level jar that names all the dependencies in Class-Path 
>>> in
>
>>> the MANIFEST.MF. This is also simpler from a user's point of view. 
>>> Of
>
>>> course this requires the top-level jar and all the dependencies to 
>>> be
>
>>> created with a certain directory structure that I can control.
>>> Currently, I have a structure where I have a root directory which 
>>> contains the top-level jar and a directory called lib, and all the 
>>> dependencies are in lib, and the top-level jar names the 
>>> dependencies
>
>>> as lib/x.jar lib/y.jar etc. I package all of this as a single zip 
>>> file for easy installation.
>>>
>>> Just to be clear this is the dir structure:
>>>
>>> root dir
>>>    |
>>>    |--- top-level.jar
>>>    |--- lib
>>>            |--- x.jar
>>>            |--- y.jar
>>>
>>> I can't register top-level.jar in my PIG script (this is the 
>>> recommended
>>> approach) because PIG then unpacks&  repackages everything into a 
>>> single jar, instead of including the jar on the classpath. I can't 
>>> use distributed cache because if I specify top-level.jar and lib 
>>> separately in mapred.cache.files, then the relative directory 
>>> locations aren't preserved. If I use the mapred.cache.archives 
>>> option
>
>>> and specify the zip file, I can't add the top-level jar to the 
>>> classpath (because the entries in mapred.job.classpath.files must be

>>> something from mapred.cache.files).
>>>
>>> If mapred.child.java.opts also allowed java.class.path to be 
>>> augmented (similar to java.library.path, which I am using for native

>>> libs that I store in another dir parallel to lib), it would have
> solved my problem.
>>> I could have specified the zip in mapred.cache.archives, and added 
>>> the jar to the classpath. Right now I can't see any solution, other 
>>> than using a shared file system and adding top-level.jar to 
>>> HADOOP_CLASSPATH
>>> - this works because I am using a small cluster that has a shared 
>>> file system but clearly it's not always feasible (and of course, 
>>> it's
>
>>> modifying Hadoop's environment).
>>>
>>> Please suggest any alternatives you can think of.
>>>
>>> Thanks,
>>> -sanjay
>>>
>>
>>

Re: Adding entries to classpath

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

A short term alternative would be to find out the order in which pig 
expands the jars, and ensure that your jars are expanded in reverse order.

As in, if you need your classpath to be "a.jar:b.jar:c.jar", and pig 
un-jar's the register'ed jar in the order they are specified in the 
script, then simply register them in reverse order -

register c.jar;
register b.jar;
register a.jar;

(I am assuming an order of expansion here, and also that there IS an 
order to begin with !).

This would be consistent with how java loads the classes for most part 
(unless you have jar level tricky dependencies : I am ignoring that 
possibility for now).

Worth a shot anyway while we wait for pig/hadoop to fix for next release.



Another alternative might be to add all dependencies into an archive, 
'expand' this in an init block in your udf, use a URLClassLoader to load 
this and use reflection to invoke your code : possibly I might be 
missing something, but it looks workable ...


Regards,
Mridul


On Thursday 12 August 2010 07:49 AM, Kaluskar, Sanjay wrote:
> Thanks Ashutosh, I will try that out.
>
> Arun,
> I had already explained why I can't register the 150 jars (very tedious,
> error prone and PIG then unpacks&  re-packs which ends up over-writing
> some of the resource files that have the same names). I also explained
> why dist cache doesn't work in this scenario (because specifying the
> jars individually doesn't preserve the dir structure, and specifying the
> zip file doesn't allow adding the jar to the classpath). I have been
> trying this out for a few days using various options suggested in the
> doc. Finally, I started reading the hadoop source code and discovered
> why none of the solutions would work. I would actually fix the
> mapred.child.java.opts to allow adding to the classpath, if it were my
> choice because it is a generic solution, it would be consistent with how
> java.library.path is handled. I would also fix PIG to not try&  mangle
> all the registered jars - I have been burnt by that. I think PIG should
> instead put all the registered jars on the classpath.
>
> -sanjay
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com]
> Sent: Wednesday, August 11, 2010 10:39 PM
> To: mapreduce-user@hadoop.apache.org
> Cc: pig-user@hadoop.apache.org
> Subject: Re: Adding entries to classpath
>
> Adding pig-user@
>
> Sanjay,
>
> You can do this in Pig by setting following -D switch at the command
> line through which you invoke Pig.
> -Dpig.streaming.ship.files=myTopLevel.jar
>
> In 0.8 release you will be able to do this from within Pig script like
> set pig.streaming.ship.files myTopLevel.jar;
>
> Note that this is just to unblock you. Its an internal Pig property
> which is not exposed to the users and may break your script if your are
> also using Streaming from within Pig. We need to find a long term
> solution for your particular use case.
>
> Hope it helps,
> Ashutosh
>
> On Wed, Aug 11, 2010 at 09:30, Arun C Murthy<ac...@yahoo-inc.com>  wrote:
>
>> Moving to mapreduce-user@, bcc common-user@.
>>
>> Why do you need to create a single top-level jar? Just register each
>> of your jars and put each in the distributed cache... however you have
>
>> 150 jars which is a lot. Is there a way you can decrease that? I'm
>> sure how you do this in pig, but in MR you have the ability to add a
>> jar in the DC to the classpath of the child
> (DistributedCache.addFileToClassPath).
>>
>> Hope that helps.
>>
>> Arun
>>
>>
>> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
>>
>>   I am using Hadoop indirectly through PIG, and some of the UDFs
>> (defined
>>> by me) need other jars at runtime (around 150) some of which have
>>> conflicting resource names. Hence, trying to unpack all of them and
>>> repacking into a single jar doesn't work. My solution is to create a
>>> single top-level jar that names all the dependencies in Class-Path in
>
>>> the MANIFEST.MF. This is also simpler from a user's point of view. Of
>
>>> course this requires the top-level jar and all the dependencies to be
>
>>> created with a certain directory structure that I can control.
>>> Currently, I have a structure where I have a root directory which
>>> contains the top-level jar and a directory called lib, and all the
>>> dependencies are in lib, and the top-level jar names the dependencies
>
>>> as lib/x.jar lib/y.jar etc. I package all of this as a single zip
>>> file for easy installation.
>>>
>>> Just to be clear this is the dir structure:
>>>
>>> root dir
>>>    |
>>>    |--- top-level.jar
>>>    |--- lib
>>>            |--- x.jar
>>>            |--- y.jar
>>>
>>> I can't register top-level.jar in my PIG script (this is the
>>> recommended
>>> approach) because PIG then unpacks&  repackages everything into a
>>> single jar, instead of including the jar on the classpath. I can't
>>> use distributed cache because if I specify top-level.jar and lib
>>> separately in mapred.cache.files, then the relative directory
>>> locations aren't preserved. If I use the mapred.cache.archives option
>
>>> and specify the zip file, I can't add the top-level jar to the
>>> classpath (because the entries in mapred.job.classpath.files must be
>>> something from mapred.cache.files).
>>>
>>> If mapred.child.java.opts also allowed java.class.path to be
>>> augmented (similar to java.library.path, which I am using for native
>>> libs that I store in another dir parallel to lib), it would have
> solved my problem.
>>> I could have specified the zip in mapred.cache.archives, and added
>>> the jar to the classpath. Right now I can't see any solution, other
>>> than using a shared file system and adding top-level.jar to
>>> HADOOP_CLASSPATH
>>> - this works because I am using a small cluster that has a shared
>>> file system but clearly it's not always feasible (and of course, it's
>
>>> modifying Hadoop's environment).
>>>
>>> Please suggest any alternatives you can think of.
>>>
>>> Thanks,
>>> -sanjay
>>>
>>
>>

RE: Adding entries to classpath

Posted by "Kaluskar, Sanjay" <sk...@informatica.com>.

Thanks Ashutosh, I will try that out.

Arun,
I had already explained why I can't register the 150 jars (very tedious,
error prone and PIG then unpacks & re-packs which ends up over-writing
some of the resource files that have the same names). I also explained
why dist cache doesn't work in this scenario (because specifying the
jars individually doesn't preserve the dir structure, and specifying the
zip file doesn't allow adding the jar to the classpath). I have been
trying this out for a few days using various options suggested in the
doc. Finally, I started reading the hadoop source code and discovered
why none of the solutions would work. I would actually fix the
mapred.child.java.opts to allow adding to the classpath, if it were my
choice because it is a generic solution, it would be consistent with how
java.library.path is handled. I would also fix PIG to not try & mangle
all the registered jars - I have been burnt by that. I think PIG should
instead put all the registered jars on the classpath.

-sanjay 

-----Original Message-----
From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com] 
Sent: Wednesday, August 11, 2010 10:39 PM
To: mapreduce-user@hadoop.apache.org
Cc: pig-user@hadoop.apache.org
Subject: Re: Adding entries to classpath

Adding pig-user@

Sanjay,

You can do this in Pig by setting following -D switch at the command
line through which you invoke Pig.
-Dpig.streaming.ship.files=myTopLevel.jar

In 0.8 release you will be able to do this from within Pig script like
set pig.streaming.ship.files myTopLevel.jar;

Note that this is just to unblock you. Its an internal Pig property
which is not exposed to the users and may break your script if your are
also using Streaming from within Pig. We need to find a long term
solution for your particular use case.

Hope it helps,
Ashutosh

On Wed, Aug 11, 2010 at 09:30, Arun C Murthy <ac...@yahoo-inc.com> wrote:

> Moving to mapreduce-user@, bcc common-user@.
>
> Why do you need to create a single top-level jar? Just register each 
> of your jars and put each in the distributed cache... however you have

> 150 jars which is a lot. Is there a way you can decrease that? I'm 
> sure how you do this in pig, but in MR you have the ability to add a 
> jar in the DC to the classpath of the child
(DistributedCache.addFileToClassPath).
>
> Hope that helps.
>
> Arun
>
>
> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
>
>  I am using Hadoop indirectly through PIG, and some of the UDFs 
> (defined
>> by me) need other jars at runtime (around 150) some of which have 
>> conflicting resource names. Hence, trying to unpack all of them and 
>> repacking into a single jar doesn't work. My solution is to create a 
>> single top-level jar that names all the dependencies in Class-Path in

>> the MANIFEST.MF. This is also simpler from a user's point of view. Of

>> course this requires the top-level jar and all the dependencies to be

>> created with a certain directory structure that I can control.
>> Currently, I have a structure where I have a root directory which 
>> contains the top-level jar and a directory called lib, and all the 
>> dependencies are in lib, and the top-level jar names the dependencies

>> as lib/x.jar lib/y.jar etc. I package all of this as a single zip 
>> file for easy installation.
>>
>> Just to be clear this is the dir structure:
>>
>> root dir
>>   |
>>   |--- top-level.jar
>>   |--- lib
>>           |--- x.jar
>>           |--- y.jar
>>
>> I can't register top-level.jar in my PIG script (this is the 
>> recommended
>> approach) because PIG then unpacks & repackages everything into a 
>> single jar, instead of including the jar on the classpath. I can't 
>> use distributed cache because if I specify top-level.jar and lib 
>> separately in mapred.cache.files, then the relative directory 
>> locations aren't preserved. If I use the mapred.cache.archives option

>> and specify the zip file, I can't add the top-level jar to the 
>> classpath (because the entries in mapred.job.classpath.files must be 
>> something from mapred.cache.files).
>>
>> If mapred.child.java.opts also allowed java.class.path to be 
>> augmented (similar to java.library.path, which I am using for native 
>> libs that I store in another dir parallel to lib), it would have
solved my problem.
>> I could have specified the zip in mapred.cache.archives, and added 
>> the jar to the classpath. Right now I can't see any solution, other 
>> than using a shared file system and adding top-level.jar to 
>> HADOOP_CLASSPATH
>> - this works because I am using a small cluster that has a shared 
>> file system but clearly it's not always feasible (and of course, it's

>> modifying Hadoop's environment).
>>
>> Please suggest any alternatives you can think of.
>>
>> Thanks,
>> -sanjay
>>
>
>

Re: Adding entries to classpath

Posted by Ashutosh Chauhan <as...@gmail.com>.

Adding pig-user@

Sanjay,

You can do this in Pig by setting following -D switch at the command line
through which you invoke Pig.
-Dpig.streaming.ship.files=myTopLevel.jar

In 0.8 release you will be able to do this from within Pig script like
set pig.streaming.ship.files myTopLevel.jar;

Note that this is just to unblock you. Its an internal Pig property which is
not exposed to the users and may break your script if your are also using
Streaming from within Pig. We need to find a long term solution for your
particular use case.

Hope it helps,
Ashutosh

On Wed, Aug 11, 2010 at 09:30, Arun C Murthy <ac...@yahoo-inc.com> wrote:

> Moving to mapreduce-user@, bcc common-user@.
>
> Why do you need to create a single top-level jar? Just register each of
> your jars and put each in the distributed cache... however you have 150 jars
> which is a lot. Is there a way you can decrease that? I'm sure how you do
> this in pig, but in MR you have the ability to add a jar in the DC to the
> classpath of the child (DistributedCache.addFileToClassPath).
>
> Hope that helps.
>
> Arun
>
>
> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
>
>  I am using Hadoop indirectly through PIG, and some of the UDFs (defined
>> by me) need other jars at runtime (around 150) some of which have
>> conflicting resource names. Hence, trying to unpack all of them and
>> repacking into a single jar doesn't work. My solution is to create a
>> single top-level jar that names all the dependencies in Class-Path in
>> the MANIFEST.MF. This is also simpler from a user's point of view. Of
>> course this requires the top-level jar and all the dependencies to be
>> created with a certain directory structure that I can control.
>> Currently, I have a structure where I have a root directory which
>> contains the top-level jar and a directory called lib, and all the
>> dependencies are in lib, and the top-level jar names the dependencies as
>> lib/x.jar lib/y.jar etc. I package all of this as a single zip file for
>> easy installation.
>>
>> Just to be clear this is the dir structure:
>>
>> root dir
>>   |
>>   |--- top-level.jar
>>   |--- lib
>>           |--- x.jar
>>           |--- y.jar
>>
>> I can't register top-level.jar in my PIG script (this is the recommended
>> approach) because PIG then unpacks & repackages everything into a single
>> jar, instead of including the jar on the classpath. I can't use
>> distributed cache because if I specify top-level.jar and lib separately
>> in mapred.cache.files, then the relative directory locations aren't
>> preserved. If I use the mapred.cache.archives option and specify the zip
>> file, I can't add the top-level jar to the classpath (because the
>> entries in mapred.job.classpath.files must be something from
>> mapred.cache.files).
>>
>> If mapred.child.java.opts also allowed java.class.path to be augmented
>> (similar to java.library.path, which I am using for native libs that I
>> store in another dir parallel to lib), it would have solved my problem.
>> I could have specified the zip in mapred.cache.archives, and added the
>> jar to the classpath. Right now I can't see any solution, other than
>> using a shared file system and adding top-level.jar to HADOOP_CLASSPATH
>> - this works because I am using a small cluster that has a shared file
>> system but clearly it's not always feasible (and of course, it's
>> modifying Hadoop's environment).
>>
>> Please suggest any alternatives you can think of.
>>
>> Thanks,
>> -sanjay
>>
>
>

Re: Adding entries to classpath

Posted by Ashutosh Chauhan <as...@gmail.com>.

Adding pig-user@

Sanjay,

You can do this in Pig by setting following -D switch at the command line
through which you invoke Pig.
-Dpig.streaming.ship.files=myTopLevel.jar

In 0.8 release you will be able to do this from within Pig script like
set pig.streaming.ship.files myTopLevel.jar;

Note that this is just to unblock you. Its an internal Pig property which is
not exposed to the users and may break your script if your are also using
Streaming from within Pig. We need to find a long term solution for your
particular use case.

Hope it helps,
Ashutosh

On Wed, Aug 11, 2010 at 09:30, Arun C Murthy <ac...@yahoo-inc.com> wrote:

> Moving to mapreduce-user@, bcc common-user@.
>
> Why do you need to create a single top-level jar? Just register each of
> your jars and put each in the distributed cache... however you have 150 jars
> which is a lot. Is there a way you can decrease that? I'm sure how you do
> this in pig, but in MR you have the ability to add a jar in the DC to the
> classpath of the child (DistributedCache.addFileToClassPath).
>
> Hope that helps.
>
> Arun
>
>
> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
>
>  I am using Hadoop indirectly through PIG, and some of the UDFs (defined
>> by me) need other jars at runtime (around 150) some of which have
>> conflicting resource names. Hence, trying to unpack all of them and
>> repacking into a single jar doesn't work. My solution is to create a
>> single top-level jar that names all the dependencies in Class-Path in
>> the MANIFEST.MF. This is also simpler from a user's point of view. Of
>> course this requires the top-level jar and all the dependencies to be
>> created with a certain directory structure that I can control.
>> Currently, I have a structure where I have a root directory which
>> contains the top-level jar and a directory called lib, and all the
>> dependencies are in lib, and the top-level jar names the dependencies as
>> lib/x.jar lib/y.jar etc. I package all of this as a single zip file for
>> easy installation.
>>
>> Just to be clear this is the dir structure:
>>
>> root dir
>>   |
>>   |--- top-level.jar
>>   |--- lib
>>           |--- x.jar
>>           |--- y.jar
>>
>> I can't register top-level.jar in my PIG script (this is the recommended
>> approach) because PIG then unpacks & repackages everything into a single
>> jar, instead of including the jar on the classpath. I can't use
>> distributed cache because if I specify top-level.jar and lib separately
>> in mapred.cache.files, then the relative directory locations aren't
>> preserved. If I use the mapred.cache.archives option and specify the zip
>> file, I can't add the top-level jar to the classpath (because the
>> entries in mapred.job.classpath.files must be something from
>> mapred.cache.files).
>>
>> If mapred.child.java.opts also allowed java.class.path to be augmented
>> (similar to java.library.path, which I am using for native libs that I
>> store in another dir parallel to lib), it would have solved my problem.
>> I could have specified the zip in mapred.cache.archives, and added the
>> jar to the classpath. Right now I can't see any solution, other than
>> using a shared file system and adding top-level.jar to HADOOP_CLASSPATH
>> - this works because I am using a small cluster that has a shared file
>> system but clearly it's not always feasible (and of course, it's
>> modifying Hadoop's environment).
>>
>> Please suggest any alternatives you can think of.
>>
>> Thanks,
>> -sanjay
>>
>
>

Re: Adding entries to classpath

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Moving to mapreduce-user@, bcc common-user@.

Why do you need to create a single top-level jar? Just register each  
of your jars and put each in the distributed cache... however you have  
150 jars which is a lot. Is there a way you can decrease that? I'm  
sure how you do this in pig, but in MR you have the ability to add a  
jar in the DC to the classpath of the child  
(DistributedCache.addFileToClassPath).

Hope that helps.

Arun

On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:

> I am using Hadoop indirectly through PIG, and some of the UDFs  
> (defined
> by me) need other jars at runtime (around 150) some of which have
> conflicting resource names. Hence, trying to unpack all of them and
> repacking into a single jar doesn't work. My solution is to create a
> single top-level jar that names all the dependencies in Class-Path in
> the MANIFEST.MF. This is also simpler from a user's point of view. Of
> course this requires the top-level jar and all the dependencies to be
> created with a certain directory structure that I can control.
> Currently, I have a structure where I have a root directory which
> contains the top-level jar and a directory called lib, and all the
> dependencies are in lib, and the top-level jar names the  
> dependencies as
> lib/x.jar lib/y.jar etc. I package all of this as a single zip file  
> for
> easy installation.
>
> Just to be clear this is the dir structure:
>
> root dir
>    |
>    |--- top-level.jar
>    |--- lib
>            |--- x.jar
>            |--- y.jar
>
> I can't register top-level.jar in my PIG script (this is the  
> recommended
> approach) because PIG then unpacks & repackages everything into a  
> single
> jar, instead of including the jar on the classpath. I can't use
> distributed cache because if I specify top-level.jar and lib  
> separately
> in mapred.cache.files, then the relative directory locations aren't
> preserved. If I use the mapred.cache.archives option and specify the  
> zip
> file, I can't add the top-level jar to the classpath (because the
> entries in mapred.job.classpath.files must be something from
> mapred.cache.files).
>
> If mapred.child.java.opts also allowed java.class.path to be augmented
> (similar to java.library.path, which I am using for native libs that I
> store in another dir parallel to lib), it would have solved my  
> problem.
> I could have specified the zip in mapred.cache.archives, and added the
> jar to the classpath. Right now I can't see any solution, other than
> using a shared file system and adding top-level.jar to  
> HADOOP_CLASSPATH
> - this works because I am using a small cluster that has a shared file
> system but clearly it's not always feasible (and of course, it's
> modifying Hadoop's environment).
>
> Please suggest any alternatives you can think of.
>
> Thanks,
> -sanjay

Re: Adding entries to classpath

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Moving to mapreduce-user@, bcc common-user@.

Why do you need to create a single top-level jar? Just register each  
of your jars and put each in the distributed cache... however you have  
150 jars which is a lot. Is there a way you can decrease that? I'm  
sure how you do this in pig, but in MR you have the ability to add a  
jar in the DC to the classpath of the child  
(DistributedCache.addFileToClassPath).

Hope that helps.

Arun

On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:

> I am using Hadoop indirectly through PIG, and some of the UDFs  
> (defined
> by me) need other jars at runtime (around 150) some of which have
> conflicting resource names. Hence, trying to unpack all of them and
> repacking into a single jar doesn't work. My solution is to create a
> single top-level jar that names all the dependencies in Class-Path in
> the MANIFEST.MF. This is also simpler from a user's point of view. Of
> course this requires the top-level jar and all the dependencies to be
> created with a certain directory structure that I can control.
> Currently, I have a structure where I have a root directory which
> contains the top-level jar and a directory called lib, and all the
> dependencies are in lib, and the top-level jar names the  
> dependencies as
> lib/x.jar lib/y.jar etc. I package all of this as a single zip file  
> for
> easy installation.
>
> Just to be clear this is the dir structure:
>
> root dir
>    |
>    |--- top-level.jar
>    |--- lib
>            |--- x.jar
>            |--- y.jar
>
> I can't register top-level.jar in my PIG script (this is the  
> recommended
> approach) because PIG then unpacks & repackages everything into a  
> single
> jar, instead of including the jar on the classpath. I can't use
> distributed cache because if I specify top-level.jar and lib  
> separately
> in mapred.cache.files, then the relative directory locations aren't
> preserved. If I use the mapred.cache.archives option and specify the  
> zip
> file, I can't add the top-level jar to the classpath (because the
> entries in mapred.job.classpath.files must be something from
> mapred.cache.files).
>
> If mapred.child.java.opts also allowed java.class.path to be augmented
> (similar to java.library.path, which I am using for native libs that I
> store in another dir parallel to lib), it would have solved my  
> problem.
> I could have specified the zip in mapred.cache.archives, and added the
> jar to the classpath. Right now I can't see any solution, other than
> using a shared file system and adding top-level.jar to  
> HADOOP_CLASSPATH
> - this works because I am using a small cluster that has a shared file
> system but clearly it's not always feasible (and of course, it's
> modifying Hadoop's environment).
>
> Please suggest any alternatives you can think of.
>
> Thanks,
> -sanjay