You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by William Oberman <ob...@civicscience.com> on 2012/10/11 16:43:30 UTC

cassandra + pig

I'm wondering how many people are using cassandra + pig out there?  I
recently went through the effort of validating things at a much higher
level than I previously did(*), and found a few issues:
https://issues.apache.org/jira/browse/CASSANDRA-4748
https://issues.apache.org/jira/browse/CASSANDRA-4749
https://issues.apache.org/jira/browse/CASSANDRA-4789

In general, it seems like the widerow implementation still has rough edges.
 I'm concerned I'm not understanding why other people aren't using the
feature, and thus finding these problems.  Is everyone else just setting a
high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X" where X >=
the max size of any key?  Is everyone else using data models that result in
keys with # columns always less than 1024?  Do newer version of hadoop
consume the cassandra API in a way that work around these issues?  I'm
using CDH3 == hadoop 0.20.2, pig 0.8.1.

(*) I took a random subsample of 50,000 keys of my production data (approx
1M total key/value pairs, some keys having only a single value and some
having 1000's).  I then wrote both a pig script and simple procedural
version of the pig script.  Then I compared the results.  Obviously I
started with differences, though after locally patching my code to fix the
above 3 bugs (though, really only two issues), I now (finally) get the same
results.

Re: cassandra + pig

Posted by William Oberman <ob...@civicscience.com>.

Thanks Jeremy!  Maybe figuring out how to do paging in pig would have been
easier, but I found the widerow setting first which led me where I am
today.  I don't mind helping to blaze trails, or contribute back when doing
so, but I usually try to follow rather than lead when it comes to
tools/software I choose use.  I didn't realize how close to the edge I was
getting in this case :-)

On Thu, Oct 11, 2012 at 1:03 PM, Jeremy Hanna <je...@gmail.com>wrote:

> For our use case, we had a lot of narrow column families and the couple of
> column families that had wide rows, we did our own paging through them.  I
> don't recall if we did paging in pig or mapreduce but you should be able to
> do that in both since pig allows you to specify the slice start.
>
> On Oct 11, 2012, at 11:28 AM, William Oberman <ob...@civicscience.com>
> wrote:
>
> > If you don't mind me asking, how are you handling the fact that
> pre-widerow you are only getting a static number of columns per key
> (default 1024)?  Or am I not understanding the "limit" concept?
> >
> > On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com> wrote:
> > The Dachis Group (where I just came from, now at DataStax) uses pig with
> cassandra for a lot of things.  However, we weren't using the widerow
> implementation yet since wide row support is new to 1.1.x and we were on
> 0.7, then 0.8, then 1.0.x.
> >
> > I think since it's new to 1.1's hadoop support, it sounds like there are
> some rough edges like you say.  But issues that are reproducible on tickets
> for any problems are much appreciated and they will get addressed.
> >
> > On Oct 11, 2012, at 10:43 AM, William Oberman <ob...@civicscience.com>
> wrote:
> >
> > > I'm wondering how many people are using cassandra + pig out there?  I
> recently went through the effort of validating things at a much higher
> level than I previously did(*), and found a few issues:
> > > https://issues.apache.org/jira/browse/CASSANDRA-4748
> > > https://issues.apache.org/jira/browse/CASSANDRA-4749
> > > https://issues.apache.org/jira/browse/CASSANDRA-4789
> > >
> > > In general, it seems like the widerow implementation still has rough
> edges.  I'm concerned I'm not understanding why other people aren't using
> the feature, and thus finding these problems.  Is everyone else just
> setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X"
> where X >= the max size of any key?  Is everyone else using data models
> that result in keys with # columns always less than 1024?  Do newer version
> of hadoop consume the cassandra API in a way that work around these issues?
>  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
> > >
> > > (*) I took a random subsample of 50,000 keys of my production data
> (approx 1M total key/value pairs, some keys having only a single value and
> some having 1000's).  I then wrote both a pig script and simple procedural
> version of the pig script.  Then I compared the results.  Obviously I
> started with differences, though after locally patching my code to fix the
> above 3 bugs (though, really only two issues), I now (finally) get the same
> results.
> >
> >
> >
>
>

Re: cassandra + pig

Posted by Jeremy Hanna <je...@gmail.com>.

For our use case, we had a lot of narrow column families and the couple of column families that had wide rows, we did our own paging through them.  I don't recall if we did paging in pig or mapreduce but you should be able to do that in both since pig allows you to specify the slice start.

On Oct 11, 2012, at 11:28 AM, William Oberman <ob...@civicscience.com> wrote:

> If you don't mind me asking, how are you handling the fact that pre-widerow you are only getting a static number of columns per key (default 1024)?  Or am I not understanding the "limit" concept?  
> 
> On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna <je...@gmail.com> wrote:
> The Dachis Group (where I just came from, now at DataStax) uses pig with cassandra for a lot of things.  However, we weren't using the widerow implementation yet since wide row support is new to 1.1.x and we were on 0.7, then 0.8, then 1.0.x.
> 
> I think since it's new to 1.1's hadoop support, it sounds like there are some rough edges like you say.  But issues that are reproducible on tickets for any problems are much appreciated and they will get addressed.
> 
> On Oct 11, 2012, at 10:43 AM, William Oberman <ob...@civicscience.com> wrote:
> 
> > I'm wondering how many people are using cassandra + pig out there?  I recently went through the effort of validating things at a much higher level than I previously did(*), and found a few issues:
> > https://issues.apache.org/jira/browse/CASSANDRA-4748
> > https://issues.apache.org/jira/browse/CASSANDRA-4749
> > https://issues.apache.org/jira/browse/CASSANDRA-4789
> >
> > In general, it seems like the widerow implementation still has rough edges.  I'm concerned I'm not understanding why other people aren't using the feature, and thus finding these problems.  Is everyone else just setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X" where X >= the max size of any key?  Is everyone else using data models that result in keys with # columns always less than 1024?  Do newer version of hadoop consume the cassandra API in a way that work around these issues?  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
> >
> > (*) I took a random subsample of 50,000 keys of my production data (approx 1M total key/value pairs, some keys having only a single value and some having 1000's).  I then wrote both a pig script and simple procedural version of the pig script.  Then I compared the results.  Obviously I started with differences, though after locally patching my code to fix the above 3 bugs (though, really only two issues), I now (finally) get the same results.
> 
> 
>

Re: cassandra + pig

Posted by William Oberman <ob...@civicscience.com>.

If you don't mind me asking, how are you handling the fact that pre-widerow
you are only getting a static number of columns per key (default 1024)?  Or
am I not understanding the "limit" concept?

On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna
<je...@gmail.com>wrote:

> The Dachis Group (where I just came from, now at DataStax) uses pig with
> cassandra for a lot of things.  However, we weren't using the widerow
> implementation yet since wide row support is new to 1.1.x and we were on
> 0.7, then 0.8, then 1.0.x.
>
> I think since it's new to 1.1's hadoop support, it sounds like there are
> some rough edges like you say.  But issues that are reproducible on tickets
> for any problems are much appreciated and they will get addressed.
>
> On Oct 11, 2012, at 10:43 AM, William Oberman <ob...@civicscience.com>
> wrote:
>
> > I'm wondering how many people are using cassandra + pig out there?  I
> recently went through the effort of validating things at a much higher
> level than I previously did(*), and found a few issues:
> > https://issues.apache.org/jira/browse/CASSANDRA-4748
> > https://issues.apache.org/jira/browse/CASSANDRA-4749
> > https://issues.apache.org/jira/browse/CASSANDRA-4789
> >
> > In general, it seems like the widerow implementation still has rough
> edges.  I'm concerned I'm not understanding why other people aren't using
> the feature, and thus finding these problems.  Is everyone else just
> setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X"
> where X >= the max size of any key?  Is everyone else using data models
> that result in keys with # columns always less than 1024?  Do newer version
> of hadoop consume the cassandra API in a way that work around these issues?
>  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
> >
> > (*) I took a random subsample of 50,000 keys of my production data
> (approx 1M total key/value pairs, some keys having only a single value and
> some having 1000's).  I then wrote both a pig script and simple procedural
> version of the pig script.  Then I compared the results.  Obviously I
> started with differences, though after locally patching my code to fix the
> above 3 bugs (though, really only two issues), I now (finally) get the same
> results.
>
>

Re: cassandra + pig

Posted by Jeremy Hanna <je...@gmail.com>.

The Dachis Group (where I just came from, now at DataStax) uses pig with cassandra for a lot of things.  However, we weren't using the widerow implementation yet since wide row support is new to 1.1.x and we were on 0.7, then 0.8, then 1.0.x.

I think since it's new to 1.1's hadoop support, it sounds like there are some rough edges like you say.  But issues that are reproducible on tickets for any problems are much appreciated and they will get addressed.

On Oct 11, 2012, at 10:43 AM, William Oberman <ob...@civicscience.com> wrote:

> I'm wondering how many people are using cassandra + pig out there?  I recently went through the effort of validating things at a much higher level than I previously did(*), and found a few issues:
> https://issues.apache.org/jira/browse/CASSANDRA-4748
> https://issues.apache.org/jira/browse/CASSANDRA-4749
> https://issues.apache.org/jira/browse/CASSANDRA-4789
> 
> In general, it seems like the widerow implementation still has rough edges.  I'm concerned I'm not understanding why other people aren't using the feature, and thus finding these problems.  Is everyone else just setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X" where X >= the max size of any key?  Is everyone else using data models that result in keys with # columns always less than 1024?  Do newer version of hadoop consume the cassandra API in a way that work around these issues?  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
> 
> (*) I took a random subsample of 50,000 keys of my production data (approx 1M total key/value pairs, some keys having only a single value and some having 1000's).  I then wrote both a pig script and simple procedural version of the pig script.  Then I compared the results.  Obviously I started with differences, though after locally patching my code to fix the above 3 bugs (though, really only two issues), I now (finally) get the same results.