You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Gregory Harman <gr...@harmantechnologies.com> on 2009/02/10 05:27:48 UTC

Source from a database?

I've noticed a couple of old threads about implementing storage to a  
database, but I haven't seen anything in the mailing list archives or  
the online docs about sourcing data from a database - it's always  
loading it from a file. Is there some fundamental reason why you  
wouldn't want to pull data directly out of a database to process with  
Pig?

Of course you could dump the DB query results to a file first, but  
that seems pretty kludgy.

Would it be a legit approach to create a loader implementation that  
abstracts a database query, returning each row as if it were a line in  
a file? Will I get in trouble with this approach when I try to scale  
things up?

thanks,
Greg

Re: Source from a database?

Posted by Alan Gates <ga...@yahoo-inc.com>.

It is certainly possible to write a LoadFunc that also implements the  
Slicer interface and loads from a database.  See the hbase patch https://issues.apache.org/jira/browse/PIG-6 
  for an example of something similar.

The issue you'll face is how to split the query in your Slicer.  How  
do you range partition a db query without having a significant amount  
of knowledge about the underlying database (such as how it's  
partitioned, how many concurrent queries it can maintain, what indices  
or keys can you use, etc.)?

Alan.

On Feb 9, 2009, at 8:27 PM, Gregory Harman wrote:

> I've noticed a couple of old threads about implementing storage to a  
> database, but I haven't seen anything in the mailing list archives  
> or the online docs about sourcing data from a database - it's always  
> loading it from a file. Is there some fundamental reason why you  
> wouldn't want to pull data directly out of a database to process  
> with Pig?
>
> Of course you could dump the DB query results to a file first, but  
> that seems pretty kludgy.
>
> Would it be a legit approach to create a loader implementation that  
> abstracts a database query, returning each row as if it were a line  
> in a file? Will I get in trouble with this approach when I try to  
> scale things up?
>
> thanks,
> Greg