You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ray Navarette <Ra...@pb.com> on 2018/02/08 14:28:17 UTC

Resources/Distributed Cache on Spark

Hello,

I'm hoping to find some information about using "ADD FILES <PATH>" when using the spark execution engine.  I've seen some jira tickets reference this functionality, but little else.  We have written some custom UDFs which require some external resources.  When using the MR execution engine, we can reference the file paths using a relative path and they are properly distributed and resolved.  When I try to do the same under spark engine, I receive an error saying the file is unavailable.

Does "ADD FILES <PATH>" work on spark, and if so, how should I properly reference those files in order to read them in the executors?

Thanks much for your help,
Ray

RE: Resources/Distributed Cache on Spark

Posted by Ray Navarette <Ra...@pb.com>.
Sorry for the resend, but does anyone know who I might best talk to about this?  Would it be worthwhile to bring this question to the dev list?

Thanks again for the help,
Ray

From: Ray Navarette [mailto:Ray.Navarette@pb.com]
Sent: Thursday, February 8, 2018 6:33 PM
To: user@hive.apache.org
Subject: RE: Resources/Distributed Cache on Spark

Without using add files, we’d have to make sure these resources exist on every node, and would configure a hive session like this:
set myCustomProperty=/path/to/directory/someSubDir/;
select myCustomUDF(‘param1’,’param2’);

With the shared resources, we can do this instead, at least with MR engine:
add files file:///path/to/directory;
set myCustomProperty=someSubDir/;
select myCustomUDF(‘param1’,’param2’);

In both cases, the property myCustomProperty is accessed inside the custom UDF, interpreted as a path, and used to read the content of a file within “someSubDir”.  This works fine whenever we have the full path, or with the relative path in the MR engine when using add resources.  I’m wondering if perhaps I’m getting lucky in that the MR engine is downloading the files to the working directory, and so the relative path is being properly resolved there, but some different behavior is happening in spark?  I can give a full path if I know ahead of time where this file will be available on the remote node, hopefully by property, like ${hive.localResourceDir}/someSubDir.

Thanks for the quick response and your help with this.

Ray

From: Sahil Takiar [mailto:takiar.sahil@gmail.com]
Sent: Thursday, February 8, 2018 12:45 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Resources/Distributed Cache on Spark

It should work. We have tests such as groupby_bigdata.q that run on HoS and work. They use the "add file" command. What are the exact commands you are running? What error are you seeing?

On Thu, Feb 8, 2018 at 6:28 AM, Ray Navarette <Ra...@pb.com>> wrote:
Hello,

I’m hoping to find some information about using “ADD FILES <PATH>” when using the spark execution engine.  I’ve seen some jira tickets reference this functionality, but little else.  We have written some custom UDFs which require some external resources.  When using the MR execution engine, we can reference the file paths using a relative path and they are properly distributed and resolved.  When I try to do the same under spark engine, I receive an error saying the file is unavailable.

Does “ADD FILES <PATH>” work on spark, and if so, how should I properly reference those files in order to read them in the executors?

Thanks much for your help,
Ray



--
Sahil Takiar
Software Engineer
takiar.sahil@gmail.com<ma...@gmail.com> | (510) 673-0309

RE: Resources/Distributed Cache on Spark

Posted by Ray Navarette <Ra...@pb.com>.
Without using add files, we’d have to make sure these resources exist on every node, and would configure a hive session like this:
set myCustomProperty=/path/to/directory/someSubDir/;
select myCustomUDF(‘param1’,’param2’);

With the shared resources, we can do this instead, at least with MR engine:
add files file:///path/to/directory;
set myCustomProperty=someSubDir/;
select myCustomUDF(‘param1’,’param2’);

In both cases, the property myCustomProperty is accessed inside the custom UDF, interpreted as a path, and used to read the content of a file within “someSubDir”.  This works fine whenever we have the full path, or with the relative path in the MR engine when using add resources.  I’m wondering if perhaps I’m getting lucky in that the MR engine is downloading the files to the working directory, and so the relative path is being properly resolved there, but some different behavior is happening in spark?  I can give a full path if I know ahead of time where this file will be available on the remote node, hopefully by property, like ${hive.localResourceDir}/someSubDir.

Thanks for the quick response and your help with this.

Ray

From: Sahil Takiar [mailto:takiar.sahil@gmail.com]
Sent: Thursday, February 8, 2018 12:45 PM
To: user@hive.apache.org
Subject: Re: Resources/Distributed Cache on Spark

It should work. We have tests such as groupby_bigdata.q that run on HoS and work. They use the "add file" command. What are the exact commands you are running? What error are you seeing?

On Thu, Feb 8, 2018 at 6:28 AM, Ray Navarette <Ra...@pb.com>> wrote:
Hello,

I’m hoping to find some information about using “ADD FILES <PATH>” when using the spark execution engine.  I’ve seen some jira tickets reference this functionality, but little else.  We have written some custom UDFs which require some external resources.  When using the MR execution engine, we can reference the file paths using a relative path and they are properly distributed and resolved.  When I try to do the same under spark engine, I receive an error saying the file is unavailable.

Does “ADD FILES <PATH>” work on spark, and if so, how should I properly reference those files in order to read them in the executors?

Thanks much for your help,
Ray



--
Sahil Takiar
Software Engineer
takiar.sahil@gmail.com<ma...@gmail.com> | (510) 673-0309

Re: Resources/Distributed Cache on Spark

Posted by Sahil Takiar <ta...@gmail.com>.
It should work. We have tests such as groupby_bigdata.q that run on HoS and
work. They use the "add file" command. What are the exact commands you are
running? What error are you seeing?

On Thu, Feb 8, 2018 at 6:28 AM, Ray Navarette <Ra...@pb.com> wrote:

> Hello,
>
>
>
> I’m hoping to find some information about using “ADD FILES <PATH>” when
> using the spark execution engine.  I’ve seen some jira tickets reference
> this functionality, but little else.  We have written some custom UDFs
> which require some external resources.  When using the MR execution engine,
> we can reference the file paths using a relative path and they are properly
> distributed and resolved.  When I try to do the same under spark engine, I
> receive an error saying the file is unavailable.
>
>
>
> Does “ADD FILES <PATH>” work on spark, and if so, how should I properly
> reference those files in order to read them in the executors?
>
>
>
> Thanks much for your help,
>
> Ray
>



-- 
Sahil Takiar
Software Engineer
takiar.sahil@gmail.com | (510) 673-0309