You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2009/07/24 18:42:25 UTC

Hive Class path libjars, auxjars, etc

I have been following some threads on the hadoop mailing list about
speeding up MR jobs. I have a few questions I am sure I can find the
answer to if I dig into the source code but I thought I could get a
quick answer.

1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
distributed cache has some overhead. I know if I create an auxlibs
directory under hive root, they will be added to libjars on startup.
If i add my jar to auxlibs on all my nodes will a UDF in the jar be
available during subsequent jobs? Or is it only necessary to add those
jars to the auxlib on the node I start the job from.

2 Dealing with the entire hive install. How much of the hive install
really needs to be replication on each datanode? If we used
distributed cache for everything the jobs would have unneeded
overhead, but hive would be 'installed on demand' from the client.

Thanks,
Edward

Re: Hive Class path libjars, auxjars, etc

Posted by Zheng Shao <zs...@gmail.com>.

I don't see a clear solution from that mailing thread: simply keeping
a TaskTrackerChild running longer won't solve the problem nicely
because tasks from different jobs should have different classpaths,
and I guess this is only supported in later versions of hadoop.

One simple way to go is to add the jars to hadoop-env.sh (which will
add those jars to the classpath to TaskTracker). This is not a nice
solution but it does give us all the performance gain no matter which
hadoop version we are using.

I think a better solution would be to add an option
"mapred.local.classpath" to JobConf - which specifies the path of jars
on the machines in the cluster. This should be done in the hadoop land
- at the beginning of the main function in TaskTracker.Child (if
TaskTracker.Child is reused, then we need to reset the classpath each
time it is running a new task)

What do you think?

Zheng

On Thu, Jul 30, 2009 at 11:54 AM, Edward Capriolo<ed...@gmail.com> wrote:
> On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolo<ed...@gmail.com> wrote:
>> On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shao<zs...@gmail.com> wrote:
>>> Hive only needs to be installed at the node that runs the hive query.
>>> All the jars will be sent to the hadoop JobClient via -libjars. The
>>> code is in ExecDriver.java.
>>>
>>> In hadoop 0.17, I don't think there is a way to add a path to
>>> classpath for a job (unless we put it in hadoop-env.sh and start
>>> TaskTracker with that path). are there any changes in the latter
>>> versions?
>>>
>>>
>>>
>>> Zheng
>>>
>>>
>>>
>>> On 7/24/09, Edward Capriolo <ed...@gmail.com> wrote:
>>>> I have been following some threads on the hadoop mailing list about
>>>> speeding up MR jobs. I have a few questions I am sure I can find the
>>>> answer to if I dig into the source code but I thought I could get a
>>>> quick answer.
>>>>
>>>> 1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
>>>> distributed cache has some overhead. I know if I create an auxlibs
>>>> directory under hive root, they will be added to libjars on startup.
>>>> If i add my jar to auxlibs on all my nodes will a UDF in the jar be
>>>> available during subsequent jobs? Or is it only necessary to add those
>>>> jars to the auxlib on the node I start the job from.
>>>>
>>>> 2 Dealing with the entire hive install. How much of the hive install
>>>> really needs to be replication on each datanode? If we used
>>>> distributed cache for everything the jobs would have unneeded
>>>> overhead, but hive would be 'installed on demand' from the client.
>>>>
>>>> Thanks,
>>>> Edward
>>>>
>>>
>>> --
>>> Sent from Gmail for mobile | mobile.google.com
>>>
>>> Yours,
>>> Zheng
>>>
>>
>> Zheng,
>>
>> A thread from the  hadoop list peaked my interest. search.
>> "hadoop jobs take long time to setup"
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E
>>
>> Can hive benefit?
>> Edward
>>
>
> Could we use something like this for a performance increase? With the
> assumption that the jars are present on all task-trackers could we
> have an alternate invocation script such as bin/hive-local ?
>
> Edward
>



-- 
Yours,
Zheng

Re: Hive Class path libjars, auxjars, etc

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolo<ed...@gmail.com> wrote:
> On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shao<zs...@gmail.com> wrote:
>> Hive only needs to be installed at the node that runs the hive query.
>> All the jars will be sent to the hadoop JobClient via -libjars. The
>> code is in ExecDriver.java.
>>
>> In hadoop 0.17, I don't think there is a way to add a path to
>> classpath for a job (unless we put it in hadoop-env.sh and start
>> TaskTracker with that path). are there any changes in the latter
>> versions?
>>
>>
>>
>> Zheng
>>
>>
>>
>> On 7/24/09, Edward Capriolo <ed...@gmail.com> wrote:
>>> I have been following some threads on the hadoop mailing list about
>>> speeding up MR jobs. I have a few questions I am sure I can find the
>>> answer to if I dig into the source code but I thought I could get a
>>> quick answer.
>>>
>>> 1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
>>> distributed cache has some overhead. I know if I create an auxlibs
>>> directory under hive root, they will be added to libjars on startup.
>>> If i add my jar to auxlibs on all my nodes will a UDF in the jar be
>>> available during subsequent jobs? Or is it only necessary to add those
>>> jars to the auxlib on the node I start the job from.
>>>
>>> 2 Dealing with the entire hive install. How much of the hive install
>>> really needs to be replication on each datanode? If we used
>>> distributed cache for everything the jobs would have unneeded
>>> overhead, but hive would be 'installed on demand' from the client.
>>>
>>> Thanks,
>>> Edward
>>>
>>
>> --
>> Sent from Gmail for mobile | mobile.google.com
>>
>> Yours,
>> Zheng
>>
>
> Zheng,
>
> A thread from the  hadoop list peaked my interest. search.
> "hadoop jobs take long time to setup"
>
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E
>
> Can hive benefit?
> Edward
>

Could we use something like this for a performance increase? With the
assumption that the jars are present on all task-trackers could we
have an alternate invocation script such as bin/hive-local ?

Edward

Re: Hive Class path libjars, auxjars, etc

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shao<zs...@gmail.com> wrote:
> Hive only needs to be installed at the node that runs the hive query.
> All the jars will be sent to the hadoop JobClient via -libjars. The
> code is in ExecDriver.java.
>
> In hadoop 0.17, I don't think there is a way to add a path to
> classpath for a job (unless we put it in hadoop-env.sh and start
> TaskTracker with that path). are there any changes in the latter
> versions?
>
>
>
> Zheng
>
>
>
> On 7/24/09, Edward Capriolo <ed...@gmail.com> wrote:
>> I have been following some threads on the hadoop mailing list about
>> speeding up MR jobs. I have a few questions I am sure I can find the
>> answer to if I dig into the source code but I thought I could get a
>> quick answer.
>>
>> 1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
>> distributed cache has some overhead. I know if I create an auxlibs
>> directory under hive root, they will be added to libjars on startup.
>> If i add my jar to auxlibs on all my nodes will a UDF in the jar be
>> available during subsequent jobs? Or is it only necessary to add those
>> jars to the auxlib on the node I start the job from.
>>
>> 2 Dealing with the entire hive install. How much of the hive install
>> really needs to be replication on each datanode? If we used
>> distributed cache for everything the jobs would have unneeded
>> overhead, but hive would be 'installed on demand' from the client.
>>
>> Thanks,
>> Edward
>>
>
> --
> Sent from Gmail for mobile | mobile.google.com
>
> Yours,
> Zheng
>

Zheng,

A thread from the  hadoop list peaked my interest. search.
"hadoop jobs take long time to setup"

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E

Can hive benefit?
Edward

Re: Hive Class path libjars, auxjars, etc

Posted by Zheng Shao <zs...@gmail.com>.

Hive only needs to be installed at the node that runs the hive query.
All the jars will be sent to the hadoop JobClient via -libjars. The
code is in ExecDriver.java.

In hadoop 0.17, I don't think there is a way to add a path to
classpath for a job (unless we put it in hadoop-env.sh and start
TaskTracker with that path). are there any changes in the latter
versions?



Zheng



On 7/24/09, Edward Capriolo <ed...@gmail.com> wrote:
> I have been following some threads on the hadoop mailing list about
> speeding up MR jobs. I have a few questions I am sure I can find the
> answer to if I dig into the source code but I thought I could get a
> quick answer.
>
> 1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
> distributed cache has some overhead. I know if I create an auxlibs
> directory under hive root, they will be added to libjars on startup.
> If i add my jar to auxlibs on all my nodes will a UDF in the jar be
> available during subsequent jobs? Or is it only necessary to add those
> jars to the auxlib on the node I start the job from.
>
> 2 Dealing with the entire hive install. How much of the hive install
> really needs to be replication on each datanode? If we used
> distributed cache for everything the jobs would have unneeded
> overhead, but hive would be 'installed on demand' from the client.
>
> Thanks,
> Edward
>

-- 
Sent from Gmail for mobile | mobile.google.com

Yours,
Zheng