You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jeremy Hanna <je...@gmail.com> on 2011/06/15 19:35:42 UTC

useful little way to run locally with (pig|hive) && cassandra

We started doing this recently and thought it might be useful to others.

Pig (and Hive) have a sample function that allows you to sample data from your data store.

In pig it looks something like this:
mysample = SAMPLE myrelation 0.01;

One possible use for this, with pig and cassandra is to solve a conundrum of testing locally.  We've wondered how to do this so we decided to do sampling of a column family (or set of CFs), store into HDFS (or CFS), download locally, then import into your local Cassandra node.  That gives you real data to test against with pig/hive or for other purposes.

That way, when you're flying out to the Hadoop Summit or the Cassandra SF event, you can play with real data :).

Maybe others have been doing this for years, but if not, we're finding it handy.

Jeremy

Re: useful little way to run locally with (pig|hive) && cassandra

Posted by Jeremy Hanna <je...@gmail.com>.

Cool - thanks Dmitriy!

On Jun 15, 2011, at 12:54 PM, Dmitriy Ryaboy wrote:

> Another tip:
> If you parametrize your load statements, it becomes easy to switch
> between loading from something like Cassandra, and reading from HDFS
> or local fs directly.
> 
> Also:
> Try using Pig's "illustrate" command when working through your flows
> -- it does some clever things that go far beyond simple random
> sampling of source data, in order to ensure that you can see the
> effects of doing filters, that joins get (possibly artificial)
> matching keys even if you sampled in a way that didn't actually
> produce any, etc.
> 
> D
> 
> On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
> <je...@gmail.com> wrote:
>> We started doing this recently and thought it might be useful to others.
>> 
>> Pig (and Hive) have a sample function that allows you to sample data from your data store.
>> 
>> In pig it looks something like this:
>> mysample = SAMPLE myrelation 0.01;
>> 
>> One possible use for this, with pig and cassandra is to solve a conundrum of testing locally.  We've wondered how to do this so we decided to do sampling of a column family (or set of CFs), store into HDFS (or CFS), download locally, then import into your local Cassandra node.  That gives you real data to test against with pig/hive or for other purposes.
>> 
>> That way, when you're flying out to the Hadoop Summit or the Cassandra SF event, you can play with real data :).
>> 
>> Maybe others have been doing this for years, but if not, we're finding it handy.
>> 
>> Jeremy

Re: useful little way to run locally with (pig|hive) && cassandra

Posted by Jeremy Hanna <je...@gmail.com>.

Cool - thanks Dmitriy!

On Jun 15, 2011, at 12:54 PM, Dmitriy Ryaboy wrote:

> Another tip:
> If you parametrize your load statements, it becomes easy to switch
> between loading from something like Cassandra, and reading from HDFS
> or local fs directly.
> 
> Also:
> Try using Pig's "illustrate" command when working through your flows
> -- it does some clever things that go far beyond simple random
> sampling of source data, in order to ensure that you can see the
> effects of doing filters, that joins get (possibly artificial)
> matching keys even if you sampled in a way that didn't actually
> produce any, etc.
> 
> D
> 
> On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
> <je...@gmail.com> wrote:
>> We started doing this recently and thought it might be useful to others.
>> 
>> Pig (and Hive) have a sample function that allows you to sample data from your data store.
>> 
>> In pig it looks something like this:
>> mysample = SAMPLE myrelation 0.01;
>> 
>> One possible use for this, with pig and cassandra is to solve a conundrum of testing locally.  We've wondered how to do this so we decided to do sampling of a column family (or set of CFs), store into HDFS (or CFS), download locally, then import into your local Cassandra node.  That gives you real data to test against with pig/hive or for other purposes.
>> 
>> That way, when you're flying out to the Hadoop Summit or the Cassandra SF event, you can play with real data :).
>> 
>> Maybe others have been doing this for years, but if not, we're finding it handy.
>> 
>> Jeremy

Re: useful little way to run locally with (pig|hive) && cassandra

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Another tip:
If you parametrize your load statements, it becomes easy to switch
between loading from something like Cassandra, and reading from HDFS
or local fs directly.

Also:
Try using Pig's "illustrate" command when working through your flows
-- it does some clever things that go far beyond simple random
sampling of source data, in order to ensure that you can see the
effects of doing filters, that joins get (possibly artificial)
matching keys even if you sampled in a way that didn't actually
produce any, etc.

D

On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
<je...@gmail.com> wrote:
> We started doing this recently and thought it might be useful to others.
>
> Pig (and Hive) have a sample function that allows you to sample data from your data store.
>
> In pig it looks something like this:
> mysample = SAMPLE myrelation 0.01;
>
> One possible use for this, with pig and cassandra is to solve a conundrum of testing locally.  We've wondered how to do this so we decided to do sampling of a column family (or set of CFs), store into HDFS (or CFS), download locally, then import into your local Cassandra node.  That gives you real data to test against with pig/hive or for other purposes.
>
> That way, when you're flying out to the Hadoop Summit or the Cassandra SF event, you can play with real data :).
>
> Maybe others have been doing this for years, but if not, we're finding it handy.
>
> Jeremy