You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Gregory Harman <gr...@harmantechnologies.com> on 2009/02/10 05:27:48 UTC
Source from a database?
I've noticed a couple of old threads about implementing storage to a
database, but I haven't seen anything in the mailing list archives or
the online docs about sourcing data from a database - it's always
loading it from a file. Is there some fundamental reason why you
wouldn't want to pull data directly out of a database to process with
Pig?
Of course you could dump the DB query results to a file first, but
that seems pretty kludgy.
Would it be a legit approach to create a loader implementation that
abstracts a database query, returning each row as if it were a line in
a file? Will I get in trouble with this approach when I try to scale
things up?
thanks,
Greg
Re: Source from a database?
Posted by Alan Gates <ga...@yahoo-inc.com>.
It is certainly possible to write a LoadFunc that also implements the
Slicer interface and loads from a database. See the hbase patch https://issues.apache.org/jira/browse/PIG-6
for an example of something similar.
The issue you'll face is how to split the query in your Slicer. How
do you range partition a db query without having a significant amount
of knowledge about the underlying database (such as how it's
partitioned, how many concurrent queries it can maintain, what indices
or keys can you use, etc.)?
Alan.
On Feb 9, 2009, at 8:27 PM, Gregory Harman wrote:
> I've noticed a couple of old threads about implementing storage to a
> database, but I haven't seen anything in the mailing list archives
> or the online docs about sourcing data from a database - it's always
> loading it from a file. Is there some fundamental reason why you
> wouldn't want to pull data directly out of a database to process
> with Pig?
>
> Of course you could dump the DB query results to a file first, but
> that seems pretty kludgy.
>
> Would it be a legit approach to create a loader implementation that
> abstracts a database query, returning each row as if it were a line
> in a file? Will I get in trouble with this approach when I try to
> scale things up?
>
> thanks,
> Greg