You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Stephen Boesch <ja...@gmail.com> on 2013/09/14 01:57:10 UTC

Options for Loading Side Data / small files in UDF

We have a UDF that is configured via a small properties file.  What are the
options for distributing the file for the task nodes?  Also we want to be
able to update the file frequently.

We are not running on AWS so S3 is not an option - and we do not have
access to NFS/other shared disk from the Mappers.

If the hive classes can access HDFS that would be likely most ideal - and
it would seem should be possible.  I am not clear how to do that - since
the standard hdfs api requires the  Configuration to be supplied - which is
not available.

Pointers appreciated.

stephenb

Re: Options for Loading Side Data / small files in UDF

Posted by Stephen Boesch <ja...@gmail.com>.
Hi Jagat,

There is no call to loading file from hdfs  in Edward's example (which I
had btw already seen).

I am looking into using getRequriedFiles()



2013/9/13 Jagat Singh <ja...@gmail.com>

> Sorry i missed that
>
> Just check this example for accessing from API
>
> https://github.com/edwardcapriolo/hive-geoip/
>
>
>
>
> On Sat, Sep 14, 2013 at 10:12 AM, Stephen Boesch <ja...@gmail.com>wrote:
>
>> I should have mentioned:  we can not use the "add file" here because this
>> is running within a framework.   we need to use Java api's
>>
>>
>> 2013/9/13 Jagat Singh <ja...@gmail.com>
>>
>>> Hi
>>>
>>> You can use distributed cache and hive add file command
>>>
>>> See here for example syntax
>>>
>>>
>>> http://stackoverflow.com/questions/15429040/add-multiple-files-to-distributed-cache-in-hive
>>>
>>> Regards,
>>>
>>> Jagat
>>>
>>>
>>> On Sat, Sep 14, 2013 at 9:57 AM, Stephen Boesch <ja...@gmail.com>wrote:
>>>
>>>>
>>>> We have a UDF that is configured via a small properties file.  What are
>>>> the options for distributing the file for the task nodes?  Also we want to
>>>> be able to update the file frequently.
>>>>
>>>> We are not running on AWS so S3 is not an option - and we do not have
>>>> access to NFS/other shared disk from the Mappers.
>>>>
>>>> If the hive classes can access HDFS that would be likely most ideal -
>>>> and it would seem should be possible.  I am not clear how to do that -
>>>> since the standard hdfs api requires the  Configuration to be supplied -
>>>> which is not available.
>>>>
>>>> Pointers appreciated.
>>>>
>>>> stephenb
>>>>
>>>
>>>
>>
>

Re: Options for Loading Side Data / small files in UDF

Posted by Jagat Singh <ja...@gmail.com>.
Sorry i missed that

Just check this example for accessing from API

https://github.com/edwardcapriolo/hive-geoip/




On Sat, Sep 14, 2013 at 10:12 AM, Stephen Boesch <ja...@gmail.com> wrote:

> I should have mentioned:  we can not use the "add file" here because this
> is running within a framework.   we need to use Java api's
>
>
> 2013/9/13 Jagat Singh <ja...@gmail.com>
>
>> Hi
>>
>> You can use distributed cache and hive add file command
>>
>> See here for example syntax
>>
>>
>> http://stackoverflow.com/questions/15429040/add-multiple-files-to-distributed-cache-in-hive
>>
>> Regards,
>>
>> Jagat
>>
>>
>> On Sat, Sep 14, 2013 at 9:57 AM, Stephen Boesch <ja...@gmail.com>wrote:
>>
>>>
>>> We have a UDF that is configured via a small properties file.  What are
>>> the options for distributing the file for the task nodes?  Also we want to
>>> be able to update the file frequently.
>>>
>>> We are not running on AWS so S3 is not an option - and we do not have
>>> access to NFS/other shared disk from the Mappers.
>>>
>>> If the hive classes can access HDFS that would be likely most ideal -
>>> and it would seem should be possible.  I am not clear how to do that -
>>> since the standard hdfs api requires the  Configuration to be supplied -
>>> which is not available.
>>>
>>> Pointers appreciated.
>>>
>>> stephenb
>>>
>>
>>
>

Re: Options for Loading Side Data / small files in UDF

Posted by Stephen Boesch <ja...@gmail.com>.
I should have mentioned:  we can not use the "add file" here because this
is running within a framework.   we need to use Java api's


2013/9/13 Jagat Singh <ja...@gmail.com>

> Hi
>
> You can use distributed cache and hive add file command
>
> See here for example syntax
>
>
> http://stackoverflow.com/questions/15429040/add-multiple-files-to-distributed-cache-in-hive
>
> Regards,
>
> Jagat
>
>
> On Sat, Sep 14, 2013 at 9:57 AM, Stephen Boesch <ja...@gmail.com> wrote:
>
>>
>> We have a UDF that is configured via a small properties file.  What are
>> the options for distributing the file for the task nodes?  Also we want to
>> be able to update the file frequently.
>>
>> We are not running on AWS so S3 is not an option - and we do not have
>> access to NFS/other shared disk from the Mappers.
>>
>> If the hive classes can access HDFS that would be likely most ideal - and
>> it would seem should be possible.  I am not clear how to do that - since
>> the standard hdfs api requires the  Configuration to be supplied - which is
>> not available.
>>
>> Pointers appreciated.
>>
>> stephenb
>>
>
>

Re: Options for Loading Side Data / small files in UDF

Posted by Jagat Singh <ja...@gmail.com>.
Hi

You can use distributed cache and hive add file command

See here for example syntax

http://stackoverflow.com/questions/15429040/add-multiple-files-to-distributed-cache-in-hive

Regards,

Jagat


On Sat, Sep 14, 2013 at 9:57 AM, Stephen Boesch <ja...@gmail.com> wrote:

>
> We have a UDF that is configured via a small properties file.  What are
> the options for distributing the file for the task nodes?  Also we want to
> be able to update the file frequently.
>
> We are not running on AWS so S3 is not an option - and we do not have
> access to NFS/other shared disk from the Mappers.
>
> If the hive classes can access HDFS that would be likely most ideal - and
> it would seem should be possible.  I am not clear how to do that - since
> the standard hdfs api requires the  Configuration to be supplied - which is
> not available.
>
> Pointers appreciated.
>
> stephenb
>