You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Pete Warden <pe...@petewarden.com> on 2011/10/11 23:24:28 UTC

pig_cassandra problem - "Incompatible field schema" error

I'm trying to run the most basic example for pig_cassandra, counting the
number of rows in a column family, and I'm hitting the following error:

2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1031: Incompatable field schema: left is
"columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
"columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"

I've tried it with various column families, with the same result, but here's
the definition of this one:

create column family FriendsAlreadyRanked with
  comparator = UTF8Type and
  column_metadata =
  [
    {column_name: time_last_ranked, validation_class: UTF8Type},
  ];

Here's the command I'm running from within pig_cassandra:

rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING CassandraStorage()
AS (key, columns:bag{T: tuple(name, value)});

Here's my versions:

Apache Pig version 0.9.1 (r1177456)

Cassandra 0.8.1

Any thoughts on how to troubleshoot this? It's obviously connecting to
Cassandra since it pulls out the column family definition, so I'm guessing
it's a Pig type definition problem, but I haven't figured out what it
expects (and all the examples just use the form above).

cheers,

           Pete

Re: pig_cassandra problem - "Incompatible field schema" error

Posted by Jeremy Hanna <je...@gmail.com>.
Just for informational purposes, Pete and I tried to troubleshoot it via twitter.  I was able to do the following with Cassandra 0.8.1 and Pig 0.9.1.  He's going to dig in to see if there's something else going on.

// Cassandra-cli stuff
// bin/cassandra-cli -h localhost -p 9160
create keyspace lala;
use lala;
create column family FriendsAlreadyRanked with
comparator = UTF8Type and
key_validation_class = UTF8Type and
column_metadata =
[
        {column_name: time_last_ranked, validation_class: UTF8Type},
];
set FriendsAlreadyRanked['mykey']['time_last_ranked'] = '2011-10-10';

// Pig stuff
// bin/pig_cassandra -x local myscript.pig
rows = LOAD 'cassandra://lala/FriendsAlreadyRanked' USING CassandraStorage() AS (key, columns:bag{T: tuple(name, value)});
dump rows;

// Ouput
(mykey,{(time_last_ranked,2011-10-10)})

On Oct 11, 2011, at 4:24 PM, Pete Warden wrote:

> I'm trying to run the most basic example for pig_cassandra, counting the number of rows in a column family, and I'm hitting the following error:
> 
> 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable field schema: left is "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
> 
> I've tried it with various column families, with the same result, but here's the definition of this one:
> 
> create column family FriendsAlreadyRanked with
>   comparator = UTF8Type and
>   column_metadata =
>   [
>     {column_name: time_last_ranked, validation_class: UTF8Type},
>   ];
> 
> Here's the command I'm running from within pig_cassandra:
> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING CassandraStorage() AS (key, columns:bag{T: tuple(name, value)});
> 
> Here's my versions:
> 
> Apache Pig version 0.9.1 (r1177456)
> 
> Cassandra 0.8.1
> 
> Any thoughts on how to troubleshoot this? It's obviously connecting to Cassandra since it pulls out the column family definition, so I'm guessing it's a Pig type definition problem, but I haven't figured out what it expects (and all the examples just use the form above).
> 
> cheers,
> 
>            Pete
> 


Re: pig_cassandra problem - "Incompatible field schema" error

Posted by Pete Warden <pe...@jetpac.com>.
JIRA filed, with a messy patch too:
https://issues.apache.org/jira/browse/CASSANDRA-3371

cheers,
           Pete

On Mon, Oct 17, 2011 at 2:27 AM, Pete Warden <pe...@jetpac.com> wrote:

> I've dug deeper into this, since this got my script running but still left
> me at sea when dealing with the actual data. It's looking like there may be
> a mismatch between the schema that's being reported by
> CassandraStorage.java, and the data that's actually returned. Here's an
> example:
>
> rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
> DESCRIBE rows;
> rows: {key: chararray,columns: {(name: chararray,value:
> bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid:
> chararray,value_pid: bytearray,matched_string:
> chararray,value_matched_string: bytearray,src_big: chararray,value_src_big:
> bytearray,time: chararray,value_time: bytearray,vote_type:
> chararray,value_vote_type: bytearray,voter: chararray,value_voter:
> bytearray)}}
> DUMP rows;
> (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
> Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})
>
> getSchema() is reporting the columns as an inner bag of tuples, each of
> which contains 16 values. In fact, getNext() seems to return an inner bag
> containing 7 tuples, each of which contains two values.
>
> I'll file a JIRA and do my best to create a patch to do the right thing,
> but I wanted to sanity check what I'm seeing here since I'm a Pig newbie. Am
> I missing something? It appears that things got out of sync with this
> change:
>
> http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083
>
> While I'm in there, is there a reason for using an inner bag to hold the
> columns? Do we ever have more than one set of the same columns for a given
> key in Cassandra? I'm thinking of tweaking things to look like this for my
> example, since it would make my processing code easier:
> rows: {cassandra_key: chararray, photo_owner: chararray, pid: chararray,
> matched_string: chararray, src_big: chararray, time: chararray,vote_type:
> chararray,voter: chararray}
> The main downside I can see is the possible clash between the cassandra_key
> value and a column with the same name.
>
> cheers,
>            Pete
>
> On Tue, Oct 11, 2011 at 11:59 PM, Pete Warden <pe...@jetpac.com> wrote:
>
>> For posterity, I ended up hacking around this by renaming the repeated
>> 'value' alias in CassandraStorage and rebuilding it. Here's the patch:
>>
>> ---
>> src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java.original 2011-10-11
>> 23:42:19.000000000 -0700
>> +++ src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java 2011-10-11
>> 23:44:26.000000000 -0700
>> @@ -357,7 +357,7 @@
>>              validator = validators.get(cdef.getName());
>>              if (validator == null)
>>                  validator = marshallers.get(1);
>> -            valSchema.setName("value");
>> +            valSchema.setName("value_"+new String(cdef.getName()));
>>              valSchema.setType(getPigType(validator));
>>              tupleFields.add(valSchema);
>>          }
>>
>> I'm not suggesting this is a correct fix, but it does allow me to move
>> forward. Another suggestion was to try Pig 0.8.1 instead, but I ran into
>> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AWhatshallIdoifIsaw%22FailedtocreateDataStorage%22%3F
>>
>> On Tue, Oct 11, 2011 at 10:34 PM, Pete Warden <pe...@jetpac.com> wrote:
>>
>>> Thanks for all your help Brandon and Jeremy, that got me to the point
>>> where I could load data.
>>>
>>> I'm now hitting a new issue that seems like it could possibly be related.
>>> When I try to access the data like this:
>>>
>>> grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING
>>> CassandraStorage();
>>> grunt> parts = FOREACH rows GENERATE key,
>>> FromCassandraBag('time_last_ranked', columns);
>>>
>>> I see the following error:
>>>
>>> 2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1108:
>>> <line 4, column 71> Duplicate schema alias: value in "columns"
>>>
>>> At first I thought it might be related to the Pygmalion helper functions,
>>> so I tried to strip it back to basics using this second line instead:
>>>
>>> parts = FOREACH rows GENERATE key,$1;
>>>
>>> and I still get an identical error.
>>>
>>> Any further thoughts on how I can dig into this?
>>>
>>> Thanks again,
>>>                     Pete
>>>
>>> On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dr...@gmail.com>wrote:
>>>
>>>> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <pe...@petewarden.com>
>>>> wrote:
>>>> > I'm trying to run the most basic example for pig_cassandra, counting
>>>> the
>>>> > number of rows in a column family, and I'm hitting the following
>>>> error:
>>>> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>> -
>>>> > ERROR 1031: Incompatable field schema: left is
>>>> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
>>>> >
>>>> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
>>>>
>>>> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
>>>> remove the 'AS' and everything after it; your schema definition
>>>> conflicts with what was inferred.
>>>>
>>>> -Brandon
>>>>
>>>
>>>
>>
>

Re: pig_cassandra problem - "Incompatible field schema" error

Posted by Pete Warden <pe...@jetpac.com>.
I've dug deeper into this, since this got my script running but still left
me at sea when dealing with the actual data. It's looking like there may be
a mismatch between the schema that's being reported by
CassandraStorage.java, and the data that's actually returned. Here's an
example:

rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
DESCRIBE rows;
rows: {key: chararray,columns: {(name: chararray,value:
bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid:
chararray,value_pid: bytearray,matched_string:
chararray,value_matched_string: bytearray,src_big: chararray,value_src_big:
bytearray,time: chararray,value_time: bytearray,vote_type:
chararray,value_vote_type: bytearray,voter: chararray,value_voter:
bytearray)}}
DUMP rows;
(691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})

getSchema() is reporting the columns as an inner bag of tuples, each of
which contains 16 values. In fact, getNext() seems to return an inner bag
containing 7 tuples, each of which contains two values.

I'll file a JIRA and do my best to create a patch to do the right thing, but
I wanted to sanity check what I'm seeing here since I'm a Pig newbie. Am I
missing something? It appears that things got out of sync with this change:
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083

While I'm in there, is there a reason for using an inner bag to hold the
columns? Do we ever have more than one set of the same columns for a given
key in Cassandra? I'm thinking of tweaking things to look like this for my
example, since it would make my processing code easier:
rows: {cassandra_key: chararray, photo_owner: chararray, pid: chararray,
matched_string: chararray, src_big: chararray, time: chararray,vote_type:
chararray,voter: chararray}
The main downside I can see is the possible clash between the cassandra_key
value and a column with the same name.

cheers,
           Pete

On Tue, Oct 11, 2011 at 11:59 PM, Pete Warden <pe...@jetpac.com> wrote:

> For posterity, I ended up hacking around this by renaming the repeated
> 'value' alias in CassandraStorage and rebuilding it. Here's the patch:
>
> --- src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java.original 2011-10-11
> 23:42:19.000000000 -0700
> +++ src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java 2011-10-11
> 23:44:26.000000000 -0700
> @@ -357,7 +357,7 @@
>              validator = validators.get(cdef.getName());
>              if (validator == null)
>                  validator = marshallers.get(1);
> -            valSchema.setName("value");
> +            valSchema.setName("value_"+new String(cdef.getName()));
>              valSchema.setType(getPigType(validator));
>              tupleFields.add(valSchema);
>          }
>
> I'm not suggesting this is a correct fix, but it does allow me to move
> forward. Another suggestion was to try Pig 0.8.1 instead, but I ran into
> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AWhatshallIdoifIsaw%22FailedtocreateDataStorage%22%3F
>
> On Tue, Oct 11, 2011 at 10:34 PM, Pete Warden <pe...@jetpac.com> wrote:
>
>> Thanks for all your help Brandon and Jeremy, that got me to the point
>> where I could load data.
>>
>> I'm now hitting a new issue that seems like it could possibly be related.
>> When I try to access the data like this:
>>
>> grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING
>> CassandraStorage();
>> grunt> parts = FOREACH rows GENERATE key,
>> FromCassandraBag('time_last_ranked', columns);
>>
>> I see the following error:
>>
>> 2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1108:
>> <line 4, column 71> Duplicate schema alias: value in "columns"
>>
>> At first I thought it might be related to the Pygmalion helper functions,
>> so I tried to strip it back to basics using this second line instead:
>>
>> parts = FOREACH rows GENERATE key,$1;
>>
>> and I still get an identical error.
>>
>> Any further thoughts on how I can dig into this?
>>
>> Thanks again,
>>                     Pete
>>
>> On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dr...@gmail.com>wrote:
>>
>>> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <pe...@petewarden.com>
>>> wrote:
>>> > I'm trying to run the most basic example for pig_cassandra, counting
>>> the
>>> > number of rows in a column family, and I'm hitting the following error:
>>> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> > ERROR 1031: Incompatable field schema: left is
>>> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
>>> >
>>> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
>>>
>>> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
>>> remove the 'AS' and everything after it; your schema definition
>>> conflicts with what was inferred.
>>>
>>> -Brandon
>>>
>>
>>
>

Re: pig_cassandra problem - "Incompatible field schema" error

Posted by Pete Warden <pe...@jetpac.com>.
For posterity, I ended up hacking around this by renaming the repeated
'value' alias in CassandraStorage and rebuilding it. Here's the patch:

--- src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java.original
2011-10-11
23:42:19.000000000 -0700
+++ src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java 2011-10-11
23:44:26.000000000 -0700
@@ -357,7 +357,7 @@
             validator = validators.get(cdef.getName());
             if (validator == null)
                 validator = marshallers.get(1);
-            valSchema.setName("value");
+            valSchema.setName("value_"+new String(cdef.getName()));
             valSchema.setType(getPigType(validator));
             tupleFields.add(valSchema);
         }

I'm not suggesting this is a correct fix, but it does allow me to move
forward. Another suggestion was to try Pig 0.8.1 instead, but I ran into
https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AWhatshallIdoifIsaw%22FailedtocreateDataStorage%22%3F

On Tue, Oct 11, 2011 at 10:34 PM, Pete Warden <pe...@jetpac.com> wrote:

> Thanks for all your help Brandon and Jeremy, that got me to the point where
> I could load data.
>
> I'm now hitting a new issue that seems like it could possibly be related.
> When I try to access the data like this:
>
> grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING
> CassandraStorage();
> grunt> parts = FOREACH rows GENERATE key,
> FromCassandraBag('time_last_ranked', columns);
>
> I see the following error:
>
> 2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1108:
> <line 4, column 71> Duplicate schema alias: value in "columns"
>
> At first I thought it might be related to the Pygmalion helper functions,
> so I tried to strip it back to basics using this second line instead:
>
> parts = FOREACH rows GENERATE key,$1;
>
> and I still get an identical error.
>
> Any further thoughts on how I can dig into this?
>
> Thanks again,
>                     Pete
>
> On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dr...@gmail.com>wrote:
>
>> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <pe...@petewarden.com> wrote:
>> > I'm trying to run the most basic example for pig_cassandra, counting the
>> > number of rows in a column family, and I'm hitting the following error:
>> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> > ERROR 1031: Incompatable field schema: left is
>> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
>> >
>> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
>>
>> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
>> remove the 'AS' and everything after it; your schema definition
>> conflicts with what was inferred.
>>
>> -Brandon
>>
>
>

Re: pig_cassandra problem - "Incompatible field schema" error

Posted by Pete Warden <pe...@jetpac.com>.
Thanks for all your help Brandon and Jeremy, that got me to the point where
I could load data.

I'm now hitting a new issue that seems like it could possibly be related.
When I try to access the data like this:

grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING
CassandraStorage();
grunt> parts = FOREACH rows GENERATE key,
FromCassandraBag('time_last_ranked', columns);

I see the following error:

2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1108:
<line 4, column 71> Duplicate schema alias: value in "columns"

At first I thought it might be related to the Pygmalion helper functions, so
I tried to strip it back to basics using this second line instead:

parts = FOREACH rows GENERATE key,$1;

and I still get an identical error.

Any further thoughts on how I can dig into this?

Thanks again,
                    Pete

On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <pe...@petewarden.com> wrote:
> > I'm trying to run the most basic example for pig_cassandra, counting the
> > number of rows in a column family, and I'm hitting the following error:
> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1031: Incompatable field schema: left is
> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
> >
> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
>
> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
> remove the 'AS' and everything after it; your schema definition
> conflicts with what was inferred.
>
> -Brandon
>

Re: pig_cassandra problem - "Incompatible field schema" error

Posted by Brandon Williams <dr...@gmail.com>.
On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <pe...@petewarden.com> wrote:
> I'm trying to run the most basic example for pig_cassandra, counting the
> number of rows in a column family, and I'm hitting the following error:
> 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1031: Incompatable field schema: left is
> "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"

After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
remove the 'AS' and everything after it; your schema definition
conflicts with what was inferred.

-Brandon