You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Prashant Kommireddi <pr...@gmail.com> on 2012/12/11 10:10:29 UTC

Question regarding a custom LoadFunc implementation

I was working on a LoadFunc and needed some ideas/second opinion on the
best way to do this:


   1. We use an API to download data from database as flat-files.
      - A query is given with table name and fields required to extract data
      2. Once 1. is done upload data to HDFS
   3. Upload the schema file to HDFS
   4. LoadFunc to read the schema file and parse data

A strict requirement is to hide the details of the location of these HDFS
files from the user issuing the pig query. For a user it could look as
simple as:

A = load 'scheme://SampleTable' using CustomLoader('$query');

User here only issues the load statement on table with a query and API
calls for importing from database could happen in the background.

What would be the best way to do this? Is it better to do the above as part
of LoadFunc, or would it rather be beneficial to do it separate and somehow
communicate the location from API import to LoadFunc?

Thanks,

Prashant

Re: Question regarding a custom LoadFunc implementation

Posted by Bill Graham <bi...@gmail.com>.
We had a yml file that mapped physical datasources to the loader that the
generic one serves as a facade to. Now we're moving to an HCatalog based
solution that handles that as well as the logical to physical resolution.
Basically the mappings are stored in a DB.


On Tue, Dec 11, 2012 at 8:20 AM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Thanks Bill. Any ideas on how to hide the location of HDFS files from the
> end user?
>
> On Tue, Dec 11, 2012 at 9:42 PM, Bill Graham <bi...@gmail.com> wrote:
>
>> I think the latter would be better. Since the LoadFunc would be decoupled
>> from the data exporter you could schedule the exporting independent of the
>> loading. We do something similar, without the $query part.
>>
>>
>> On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <prash1784@gmail.com
>> >wrote:
>>
>> > I was working on a LoadFunc and needed some ideas/second opinion on the
>> > best way to do this:
>> >
>> >
>> >    1. We use an API to download data from database as flat-files.
>> >       - A query is given with table name and fields required to extract
>> > data
>> >       2. Once 1. is done upload data to HDFS
>> >    3. Upload the schema file to HDFS
>> >    4. LoadFunc to read the schema file and parse data
>> >
>> > A strict requirement is to hide the details of the location of these
>> HDFS
>> > files from the user issuing the pig query. For a user it could look as
>> > simple as:
>> >
>> > A = load 'scheme://SampleTable' using CustomLoader('$query');
>> >
>> > User here only issues the load statement on table with a query and API
>> > calls for importing from database could happen in the background.
>> >
>> > What would be the best way to do this? Is it better to do the above as
>> part
>> > of LoadFunc, or would it rather be beneficial to do it separate and
>> somehow
>> > communicate the location from API import to LoadFunc?
>> >
>> > Thanks,
>> >
>> > Prashant
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Question regarding a custom LoadFunc implementation

Posted by Prashant Kommireddi <pr...@gmail.com>.
Thanks Bill. Any ideas on how to hide the location of HDFS files from the
end user?

On Tue, Dec 11, 2012 at 9:42 PM, Bill Graham <bi...@gmail.com> wrote:

> I think the latter would be better. Since the LoadFunc would be decoupled
> from the data exporter you could schedule the exporting independent of the
> loading. We do something similar, without the $query part.
>
>
> On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > I was working on a LoadFunc and needed some ideas/second opinion on the
> > best way to do this:
> >
> >
> >    1. We use an API to download data from database as flat-files.
> >       - A query is given with table name and fields required to extract
> > data
> >       2. Once 1. is done upload data to HDFS
> >    3. Upload the schema file to HDFS
> >    4. LoadFunc to read the schema file and parse data
> >
> > A strict requirement is to hide the details of the location of these HDFS
> > files from the user issuing the pig query. For a user it could look as
> > simple as:
> >
> > A = load 'scheme://SampleTable' using CustomLoader('$query');
> >
> > User here only issues the load statement on table with a query and API
> > calls for importing from database could happen in the background.
> >
> > What would be the best way to do this? Is it better to do the above as
> part
> > of LoadFunc, or would it rather be beneficial to do it separate and
> somehow
> > communicate the location from API import to LoadFunc?
> >
> > Thanks,
> >
> > Prashant
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: Question regarding a custom LoadFunc implementation

Posted by Bill Graham <bi...@gmail.com>.
I think the latter would be better. Since the LoadFunc would be decoupled
from the data exporter you could schedule the exporting independent of the
loading. We do something similar, without the $query part.


On Tue, Dec 11, 2012 at 1:10 AM, Prashant Kommireddi <pr...@gmail.com>wrote:

> I was working on a LoadFunc and needed some ideas/second opinion on the
> best way to do this:
>
>
>    1. We use an API to download data from database as flat-files.
>       - A query is given with table name and fields required to extract
> data
>       2. Once 1. is done upload data to HDFS
>    3. Upload the schema file to HDFS
>    4. LoadFunc to read the schema file and parse data
>
> A strict requirement is to hide the details of the location of these HDFS
> files from the user issuing the pig query. For a user it could look as
> simple as:
>
> A = load 'scheme://SampleTable' using CustomLoader('$query');
>
> User here only issues the load statement on table with a query and API
> calls for importing from database could happen in the background.
>
> What would be the best way to do this? Is it better to do the above as part
> of LoadFunc, or would it rather be beneficial to do it separate and somehow
> communicate the location from API import to LoadFunc?
>
> Thanks,
>
> Prashant
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*