You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by William Oberman <ob...@civicscience.com> on 2012/09/11 17:17:14 UTC

best practice for Pig + MySql for meta data lookups

Hello,

My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
"relational/meta data".  Up until now that has been fine, but now I need to
start creating metrics that "cross the lines".  In particular, I need to
create aggregations of Cassandra data based on lookups from MySql.

After doing some research, it seems like my best option is using something
like Sqoop to map the meta/relational data I need from MySql -> HDFS, and
then use HDFS inside of Pig for the actual lookups.  I'd like to confirm
that general strategy is correct (or any other tips).

Thanks!

will

Re: best practice for Pig + MySql for meta data lookups

Posted by William Oberman <ob...@civicscience.com>.

Hello again,

I spend the last day trying to load one of my MySql tables into a Pig bag
of tuples based on a sqoop-import.  I have something that seems to work
finally, but I wanted to double check there isn't a better/easy way, as
there was a *lot* of trial and error to get to this point.  Plus, if this
is correct and someone else struggles with this particular integration
maybe they'll find this post someday.

Overall, my biggest problem was I tried to test things using "ILLUSTRATE
bag_that_represents_table;".  If I just did
"DUMP bag_that_represents_table;", I would have had success *much* sooner.
 But, I hit at least one bug and a classloader issue that only exist (as
far as I can tell) because of ILLUSTRATE (in CDU3u5 which is Pig 0.8.1).
 In particular:

1.) Doing the MySql -> HDFS I had to add two "non-standard" parameters to
sqoop-import
--as-sequencefile => I had trouble figuring out how to load the default
(text data) into Pig.  Eventually I found this:
https://review.cloudera.org/r/1670/
Which references this code:
https://issues.cloudera.org/secure/attachment/10684/0001-Hive-SerDe-and-Pig-LoadFunc-to-access-Sqoop-sequence.patch
And that loader code seems to properly work given an --as-sequencefile
sqoop-import.

--columns => I'm using cloudera CDU3u5, so Pig 0.8.1, and I think there is
a bug in DisplayExamples.ShortenField(Object) that results in a NPE for
null values in ILLUSTRATE.  I changed my sqoop-import to only grab the
columns I wanted (which are all non-null).

2.)  To get ILLUSTRATE to work, I had to add sqoop and my autogenerated
table jar (from the sqoop-import) to the *local* class path.  In particular
I had to add sqoop-1.3.0-cdh3u5.jar and TABLE_NAME.jar to /usr/lib/pig/lib

3.) To get DUMP to work, I had to "register XYZ.jar" for the two jars
referenced above in #2 to my pig script.

My problem is I did #3 first, and it took me forever to figure out
ILLUSTRATE wasn't respecting the register commands in terms of the
classpath to find sqoop and the table jar.  No idea *why* that is true, but
the "class not found" errors definitely went away as soon as I added the
jars to my local pig LIB_DIR.

I'm not 100% convinced I had to use sequence files, as my problems at the
time may have been related to the classloader issues.  But, now that it's
working I planned to move on, rather than figure that out :-)

will


On Tue, Sep 11, 2012 at 2:09 PM, William Oberman
<ob...@civicscience.com>wrote:

> Thanks (again)!
>
> I'm already using CassandraStorage to load the JSON strings.  I used Maps
> because I liked being able to name the fields, but I could easily change my
> UDF (and my Pig script) to use tuples instead.  Maybe this is because I
> found Pig (and Hadoop) coming from the world of Cassandra rather than vice
> versa.
>
> I'll look into Join and Cogroup more, and I'll see if I can puzzle through
> how to load Sqoop persisted data into Pig.
>
> will
>
> On Tue, Sep 11, 2012 at 12:58 PM, Bill Graham <bi...@gmail.com>wrote:
>
>> Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you
>> can.
>> For example you could read from Cassandra with using CassangraStorage[1]
>> and produce a Tuple of objects. If your data is JSON in Cassandra you
>> could
>> use a UDF to convert that to Tuples. Then you can then join or cogroup
>> those tuples with others that you've imported from the DB.
>>
>> 1 - I've never used this:
>>
>> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
>>
>> On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
>> <ob...@civicscience.com>wrote:
>>
>> > Great news (for me)! :-)  My relational data is small (both relative to
>> the
>> > big data, but also absolutely).
>> >
>> > I'm reading about Sqoop now, and it seems relatively straight forward.
>> >
>> > My current problem is not having done this kind of combining of data
>> before
>> > in MR (which for me means Pig).  Right now I have to pipe my Cassandra
>> data
>> > through a UDF, as the data itself is JSON (and I map it to a Map of well
>> > defined fields).  I was originally thinking I could just add a new
>> field to
>> > my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
>> > even if I knew how to read HDFS, I don't know how to read data produced
>> by
>> > Sqoop stored in HDFS).
>> >
>> > Now I'm wondering if this is the wrong mental model entirely.  I haven't
>> > figured out the details (obviously!), but it seems possible that using
>> Pig
>> > itself (without resorting to UDFs) I could
>> > -load my Cassandra data
>> > -load my HDFS data
>> > -combine them
>> > But, I'm puzzling on the how for the 2nd and 3rd items.
>> >
>> > It's hard to get specific without getting *really* specific, but all of
>> the
>> > new problems I have seem to boil down to something like:
>> > 1.) Inside Pig I have a Map that contains a field with value X
>> > 2.) I have meta data in MySql that maps that X to a more general
>> grouping Y
>> > 3.) I want to create reporting data based on both X and Y
>> > The goal being to see how Y is doing overall, and how each X_i of Y are
>> > doing relative to each other....
>> >
>> > will
>> >
>> >
>> > On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <bi...@gmail.com>
>> > wrote:
>> >
>> > > That approach makes sense. We have similar situations where we pull
>> > > relation data into HDFS and then join/agg with it via MR. In other
>> cases
>> > > we'll export aggregated HDFS data into a relational DB and then do
>> > > additional aggs using SQL. That option of course only works of your
>> data
>> > > sizes are within reason.
>> > >
>> > >
>> > > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
>> > > <ob...@civicscience.com>wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql
>> for my
>> > > > "relational/meta data".  Up until now that has been fine, but now I
>> > need
>> > > to
>> > > > start creating metrics that "cross the lines".  In particular, I
>> need
>> > to
>> > > > create aggregations of Cassandra data based on lookups from MySql.
>> > > >
>> > > > After doing some research, it seems like my best option is using
>> > > something
>> > > > like Sqoop to map the meta/relational data I need from MySql ->
>> HDFS,
>> > and
>> > > > then use HDFS inside of Pig for the actual lookups.  I'd like to
>> > confirm
>> > > > that general strategy is correct (or any other tips).
>> > > >
>> > > > Thanks!
>> > > >
>> > > > will
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *Note that I'm no longer using my Yahoo! email address. Please email
>> me
>> > at
>> > > billgraham@gmail.com going forward.*
>> > >
>> >
>>
>
>

Re: best practice for Pig + MySql for meta data lookups

Posted by William Oberman <ob...@civicscience.com>.

Thanks (again)!

I'm already using CassandraStorage to load the JSON strings.  I used Maps
because I liked being able to name the fields, but I could easily change my
UDF (and my Pig script) to use tuples instead.  Maybe this is because I
found Pig (and Hadoop) coming from the world of Cassandra rather than vice
versa.

I'll look into Join and Cogroup more, and I'll see if I can puzzle through
how to load Sqoop persisted data into Pig.

will

On Tue, Sep 11, 2012 at 12:58 PM, Bill Graham <bi...@gmail.com> wrote:

> Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
> For example you could read from Cassandra with using CassangraStorage[1]
> and produce a Tuple of objects. If your data is JSON in Cassandra you could
> use a UDF to convert that to Tuples. Then you can then join or cogroup
> those tuples with others that you've imported from the DB.
>
> 1 - I've never used this:
>
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
>
> On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
> <ob...@civicscience.com>wrote:
>
> > Great news (for me)! :-)  My relational data is small (both relative to
> the
> > big data, but also absolutely).
> >
> > I'm reading about Sqoop now, and it seems relatively straight forward.
> >
> > My current problem is not having done this kind of combining of data
> before
> > in MR (which for me means Pig).  Right now I have to pipe my Cassandra
> data
> > through a UDF, as the data itself is JSON (and I map it to a Map of well
> > defined fields).  I was originally thinking I could just add a new field
> to
> > my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> > even if I knew how to read HDFS, I don't know how to read data produced
> by
> > Sqoop stored in HDFS).
> >
> > Now I'm wondering if this is the wrong mental model entirely.  I haven't
> > figured out the details (obviously!), but it seems possible that using
> Pig
> > itself (without resorting to UDFs) I could
> > -load my Cassandra data
> > -load my HDFS data
> > -combine them
> > But, I'm puzzling on the how for the 2nd and 3rd items.
> >
> > It's hard to get specific without getting *really* specific, but all of
> the
> > new problems I have seem to boil down to something like:
> > 1.) Inside Pig I have a Map that contains a field with value X
> > 2.) I have meta data in MySql that maps that X to a more general
> grouping Y
> > 3.) I want to create reporting data based on both X and Y
> > The goal being to see how Y is doing overall, and how each X_i of Y are
> > doing relative to each other....
> >
> > will
> >
> >
> > On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <bi...@gmail.com>
> > wrote:
> >
> > > That approach makes sense. We have similar situations where we pull
> > > relation data into HDFS and then join/agg with it via MR. In other
> cases
> > > we'll export aggregated HDFS data into a relational DB and then do
> > > additional aggs using SQL. That option of course only works of your
> data
> > > sizes are within reason.
> > >
> > >
> > > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > > <ob...@civicscience.com>wrote:
> > >
> > > > Hello,
> > > >
> > > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for
> my
> > > > "relational/meta data".  Up until now that has been fine, but now I
> > need
> > > to
> > > > start creating metrics that "cross the lines".  In particular, I need
> > to
> > > > create aggregations of Cassandra data based on lookups from MySql.
> > > >
> > > > After doing some research, it seems like my best option is using
> > > something
> > > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> > and
> > > > then use HDFS inside of Pig for the actual lookups.  I'd like to
> > confirm
> > > > that general strategy is correct (or any other tips).
> > > >
> > > > Thanks!
> > > >
> > > > will
> > > >
> > >
> > >
> > >
> > > --
> > > *Note that I'm no longer using my Yahoo! email address. Please email me
> > at
> > > billgraham@gmail.com going forward.*
> > >
> >
>

Re: best practice for Pig + MySql for meta data lookups

Posted by Bill Graham <bi...@gmail.com>.

Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
For example you could read from Cassandra with using CassangraStorage[1]
and produce a Tuple of objects. If your data is JSON in Cassandra you could
use a UDF to convert that to Tuples. Then you can then join or cogroup
those tuples with others that you've imported from the DB.

1 - I've never used this:
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java

On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
<ob...@civicscience.com>wrote:

> Great news (for me)! :-)  My relational data is small (both relative to the
> big data, but also absolutely).
>
> I'm reading about Sqoop now, and it seems relatively straight forward.
>
> My current problem is not having done this kind of combining of data before
> in MR (which for me means Pig).  Right now I have to pipe my Cassandra data
> through a UDF, as the data itself is JSON (and I map it to a Map of well
> defined fields).  I was originally thinking I could just add a new field to
> my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> even if I knew how to read HDFS, I don't know how to read data produced by
> Sqoop stored in HDFS).
>
> Now I'm wondering if this is the wrong mental model entirely.  I haven't
> figured out the details (obviously!), but it seems possible that using Pig
> itself (without resorting to UDFs) I could
> -load my Cassandra data
> -load my HDFS data
> -combine them
> But, I'm puzzling on the how for the 2nd and 3rd items.
>
> It's hard to get specific without getting *really* specific, but all of the
> new problems I have seem to boil down to something like:
> 1.) Inside Pig I have a Map that contains a field with value X
> 2.) I have meta data in MySql that maps that X to a more general grouping Y
> 3.) I want to create reporting data based on both X and Y
> The goal being to see how Y is doing overall, and how each X_i of Y are
> doing relative to each other....
>
> will
>
>
> On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <bi...@gmail.com>
> wrote:
>
> > That approach makes sense. We have similar situations where we pull
> > relation data into HDFS and then join/agg with it via MR. In other cases
> > we'll export aggregated HDFS data into a relational DB and then do
> > additional aggs using SQL. That option of course only works of your data
> > sizes are within reason.
> >
> >
> > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > <ob...@civicscience.com>wrote:
> >
> > > Hello,
> > >
> > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > > "relational/meta data".  Up until now that has been fine, but now I
> need
> > to
> > > start creating metrics that "cross the lines".  In particular, I need
> to
> > > create aggregations of Cassandra data based on lookups from MySql.
> > >
> > > After doing some research, it seems like my best option is using
> > something
> > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> and
> > > then use HDFS inside of Pig for the actual lookups.  I'd like to
> confirm
> > > that general strategy is correct (or any other tips).
> > >
> > > Thanks!
> > >
> > > will
> > >
> >
> >
> >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> > billgraham@gmail.com going forward.*
> >
>

Re: best practice for Pig + MySql for meta data lookups

Posted by William Oberman <ob...@civicscience.com>.

Great news (for me)! :-)  My relational data is small (both relative to the
big data, but also absolutely).

I'm reading about Sqoop now, and it seems relatively straight forward.

My current problem is not having done this kind of combining of data before
in MR (which for me means Pig).  Right now I have to pipe my Cassandra data
through a UDF, as the data itself is JSON (and I map it to a Map of well
defined fields).  I was originally thinking I could just add a new field to
my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
even if I knew how to read HDFS, I don't know how to read data produced by
Sqoop stored in HDFS).

Now I'm wondering if this is the wrong mental model entirely.  I haven't
figured out the details (obviously!), but it seems possible that using Pig
itself (without resorting to UDFs) I could
-load my Cassandra data
-load my HDFS data
-combine them
But, I'm puzzling on the how for the 2nd and 3rd items.

It's hard to get specific without getting *really* specific, but all of the
new problems I have seem to boil down to something like:
1.) Inside Pig I have a Map that contains a field with value X
2.) I have meta data in MySql that maps that X to a more general grouping Y
3.) I want to create reporting data based on both X and Y
The goal being to see how Y is doing overall, and how each X_i of Y are
doing relative to each other....

will

On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <bi...@gmail.com> wrote:

> That approach makes sense. We have similar situations where we pull
> relation data into HDFS and then join/agg with it via MR. In other cases
> we'll export aggregated HDFS data into a relational DB and then do
> additional aggs using SQL. That option of course only works of your data
> sizes are within reason.
>
>
> On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> <ob...@civicscience.com>wrote:
>
> > Hello,
> >
> > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > "relational/meta data".  Up until now that has been fine, but now I need
> to
> > start creating metrics that "cross the lines".  In particular, I need to
> > create aggregations of Cassandra data based on lookups from MySql.
> >
> > After doing some research, it seems like my best option is using
> something
> > like Sqoop to map the meta/relational data I need from MySql -> HDFS, and
> > then use HDFS inside of Pig for the actual lookups.  I'd like to confirm
> > that general strategy is correct (or any other tips).
> >
> > Thanks!
> >
> > will
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: best practice for Pig + MySql for meta data lookups

Posted by Bill Graham <bi...@gmail.com>.

That approach makes sense. We have similar situations where we pull
relation data into HDFS and then join/agg with it via MR. In other cases
we'll export aggregated HDFS data into a relational DB and then do
additional aggs using SQL. That option of course only works of your data
sizes are within reason.

On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
<ob...@civicscience.com>wrote:

> Hello,
>
> My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> "relational/meta data".  Up until now that has been fine, but now I need to
> start creating metrics that "cross the lines".  In particular, I need to
> create aggregations of Cassandra data based on lookups from MySql.
>
> After doing some research, it seems like my best option is using something
> like Sqoop to map the meta/relational data I need from MySql -> HDFS, and
> then use HDFS inside of Pig for the actual lookups.  I'd like to confirm
> that general strategy is correct (or any other tips).
>
> Thanks!
>
> will
>

-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*