You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chad Johnston <cj...@megatome.com> on 2013/08/23 17:32:33 UTC

CqlStorage creates wrong schema for Pig

(I'm using Cassandra 1.2.8 and Pig 0.11.1)

I'm loading some simple data from Cassandra into Pig using CqlStorage. The
CqlStorage loader defines a Pig schema based on the Cassandra schema, but
it seems to be wrong.

If I do:

data = LOAD 'cql://bookdata/books' USING CqlStorage();
DESCRIBE data;

I get this:

data: {isbn: chararray,bookauthor: chararray,booktitle:
chararray,publisher: chararray,yearofpublication: int}

However, if I DUMP data, I get results like these:

((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))

Clearly the results from Cassandra are key/value pairs, as would be
expected. I don't know why the schema generated by CqlStorage() would be so
different.

This is really causing me problems trying to access the column values. I
tried a naive approach of FLATTENing each tuple, then trying to access the
values that way:

flattened = FOREACH data GENERATE
  FLATTEN(isbn),
  FLATTEN(booktitle),
  ...
values = FOREACH flattened GENERATE
  $1 AS ISBN,
  $3 AS BookTitle,
  ...

As soon as I try to access field $5, Pig complains about the index being
out of bounds.

Is there a way to solve the schema/reality mismatch? Am I doing something
wrong, or have I stumbled across a defect?

Thanks,
Chad

Re: CqlStorage creates wrong schema for Pig

Posted by Miguel Angel Martin junquera <mi...@gmail.com>.

Oppps,   sorry by  my oversight

I was checking the code and  I was surprised it did not work with that pig
script ...

now , It works fine ..

Many thanks,Chad

Have a nice day


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.martin@brainsins.com



2013/9/3 Chad Johnston <cj...@megatome.com>

> You're trying to use FromCqlColumn on a tuple that has been flattened. The
> schema still thinks it's {title: chararray}, but the flattened tuple is now
> two values. I don't know how to retrieve the data values in this case.
>
> Your code will work correctly if you do this:
> *values3 = FOREACH rows GENERATE FromCqlColumn(title) AS title;*
> *dump values3;*
> *describe values3;*
>
> (Use FromCqlColumn on the original data, not the flattened data.)
>
> Chad
>
>
> On Mon, Sep 2, 2013 at 8:45 AM, Miguel Angel Martin junquera <
> mianmarjun.mailinglist@gmail.com> wrote:
>
>> Hi
>>
>>
>> 1.-
>>
>> May be?
>>
>> -- Register the UDF
>> REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT
>>
>> -- FromCqlColumn will convert chararray, int, long, float, double
>> DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();
>>
>> -- Load data as normal
>> data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();
>>
>> -- Use the UDF
>> data = FOREACH data_raw GENERATE
>>     *FromCqlColumn*(isbn) AS ISBN,
>>     *FromCqlColumn*(bookauthor) AS BookAuthor,
>>
>>
>>     *FromCqlColumn*(booktitle) AS BookTitle,
>>     *FromCqlColumn*(publisher) AS Publisher,
>>
>>
>>     *FromCqlColumn*(yearofpublication) AS YearOfPublication;
>>
>>
>>
>>
>>
>> and  2.:
>>
>> with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:
>>
>> *CREATE KEYSPACE keyspace1*
>>
>> *  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
>> : 1 }*
>>
>> *  AND durable_writes = true;*
>>
>> *
>> *
>>
>> *use keyspace2;*
>>
>>  *
>> *
>>
>> *  CREATE TABLE test (*
>>
>> *    id text PRIMARY KEY,*
>>
>> *    title text,*
>>
>> *    age int*
>>
>> *  )  WITH COMPACT STORAGE;*
>>
>> *
>> *
>>
>> *
>> *
>>
>> *  insert into test (id, title, age) values('1', 'child', 21);*
>>
>> *  insert into test (id, title, age) values('2', 'support', 21);*
>>
>> *  insert into test (id, title, age) values('3', 'manager', 31);*
>>
>> *  insert into test (id, title, age) values('4', 'QA', 41);*
>>
>> *  insert into test (id, title, age) values('5', 'QA', 30);*
>>
>> *  insert into test (id, title, age) values('6', 'QA', 30);*
>>
>>
>>
>>
>>
>> and script:
>>
>> *
>> *
>> *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
>> *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
>> *rows = LOAD
>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>> CqlStorage();*
>> *dump rows;*
>> *ILLUSTRATE rows;*
>> *describe rows;*
>> *A = FOREACH rows GENERATE FLATTEN(title);*
>> *dump A;*
>> *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
>> *dump values3;*
>> *describe values3;*
>>
>>
>> --
>>
>>
>>
>> I have this error:
>>
>>
>>
>>
>> ....
>>
>> -------------------------------------------------------------
>> | rows     | id:chararray   | age:int   | title:chararray   |
>> -------------------------------------------------------------
>> |          | (id, 5)        | (age, 30) | (title, QA)       |
>> -------------------------------------------------------------
>>
>> rows: {id: chararray,age: int,title: chararray}
>>
>>
>> ...
>>
>> (title,QA)
>> (title,QA)
>> ..
>> 2013-09-02 16:40:52,454 [Thread-11] WARN
>>  org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
>> *java.lang.ClassCastException: java.lang.String cannot be cast to
>> org.apache.pig.data.Tuple*
>> at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
>> at
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>  at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> 2013-09-02 16:40:52,832 [main] INFO
>>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - HadoopJobId: job_local_0003
>>
>>
>>
>> 8-|
>>
>> Regards
>>
>> ...
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.martin@brainsins.com
>>
>>
>>
>> 2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>
>>
>>> hi all:
>>>
>>> More info :
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-5941
>>>
>>>
>>>
>>> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>>>
>>> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
>>> cd cassandra
>>> git checkout cassandra-1.2
>>> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
>>> ant
>>>
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.martin@brainsins.com
>>>
>>>
>>>
>>> 2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>
>>>
>>>> *good/nice job !!!*
>>>> *
>>>> *
>>>> *
>>>> *
>>>> *I'd testing with an udf only with  string schema type  this is better
>>>> and elaborate work..*
>>>> *
>>>> *
>>>> *Regads*
>>>>
>>>>
>>>> Miguel Angel Martín Junquera
>>>> Analyst Engineer.
>>>> miguelangel.martin@brainsins.com
>>>>
>>>>
>>>>
>>>> 2013/8/31 Chad Johnston <cj...@megatome.com>
>>>>
>>>>> I threw together a quick UDF to work around this issue. It just
>>>>> extracts the value portion of the tuple while taking advantage of the
>>>>> CqlStorage generated schema to keep the type correct.
>>>>>
>>>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>>>
>>>>> I'll see if I can find more useful information and open a defect,
>>>>> since that's what this seems to be.
>>>>>
>>>>> Chad
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>>>> mianmarjun.mailinglist@gmail.com> wrote:
>>>>>
>>>>>> I try this:
>>>>>>
>>>>>>  *rows = LOAD
>>>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>>>>>> CqlStorage();*
>>>>>>
>>>>>> *dump rows;*
>>>>>>
>>>>>> *ILLUSTRATE rows;*
>>>>>>
>>>>>> *describe rows;*
>>>>>>
>>>>>> *
>>>>>> *
>>>>>>
>>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>>>> (mycolumn:tuple(name,value));*
>>>>>>
>>>>>> *dump values2;*
>>>>>>
>>>>>> *describe values2;*
>>>>>> *
>>>>>> *
>>>>>>
>>>>>> But I get this results:
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------
>>>>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>>>>> -------------------------------------------------------------
>>>>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> rows: {id: chararray,age: int,title: chararray}
>>>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>>> - ERROR 1031: Incompatable field schema: left is
>>>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> or
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....
>>>>>>
>>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>>>> *dump values2;*
>>>>>> *describe values2;*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> and  the results are:
>>>>>>
>>>>>>
>>>>>> ...
>>>>>> (((id,6)))
>>>>>> (((id,5)))
>>>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>>>>
>>>>>>
>>>>>>
>>>>>> Aggg!!!!!
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *
>>>>>>
>>>>>>
>>>>>>
>>>>>> Miguel Angel Martín Junquera
>>>>>> Analyst Engineer.
>>>>>> miguelangel.martin@brainsins.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/8/26 Miguel Angel Martin junquera <
>>>>>> mianmarjun.mailinglist@gmail.com>
>>>>>>
>>>>>>> hi Chad .
>>>>>>>
>>>>>>> I have this issue
>>>>>>>
>>>>>>> I send a mail to user-pig-list and  I still i can resolve this, and
>>>>>>> I can not  access to column values.
>>>>>>> In this mail  I write some things that I try without results... and
>>>>>>> information about this issue.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I hope  someOne reply  one comment, idea or  solution about  this
>>>>>>> issue or bug.
>>>>>>>
>>>>>>>
>>>>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i
>>>>>>> do not have configure the environmetn to debug  and trace this issue.
>>>>>>>
>>>>>>> Only  I find some comments like, but I do not understand at all.
>>>>>>>
>>>>>>>
>>>>>>> /**
>>>>>>>
>>>>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>>>>> Cassandra
>>>>>>>
>>>>>>>  *
>>>>>>>
>>>>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>>>>
>>>>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2,
>>>>>>> val2))).
>>>>>>>  */
>>>>>>>
>>>>>>>
>>>>>>> I you found some idea or solution, please post it
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/8/23 Chad Johnston <cj...@megatome.com>
>>>>>>>
>>>>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>>>>
>>>>>>>> I'm loading some simple data from Cassandra into Pig using
>>>>>>>> CqlStorage. The CqlStorage loader defines a Pig schema based on the
>>>>>>>> Cassandra schema, but it seems to be wrong.
>>>>>>>>
>>>>>>>> If I do:
>>>>>>>>
>>>>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>>>>> DESCRIBE data;
>>>>>>>>
>>>>>>>> I get this:
>>>>>>>>
>>>>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>>>>
>>>>>>>> However, if I DUMP data, I get results like these:
>>>>>>>>
>>>>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>>>>
>>>>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>>>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>>>>>>> different.
>>>>>>>>
>>>>>>>> This is really causing me problems trying to access the column
>>>>>>>> values. I tried a naive approach of FLATTENing each tuple, then trying to
>>>>>>>> access the values that way:
>>>>>>>>
>>>>>>>> flattened = FOREACH data GENERATE
>>>>>>>>   FLATTEN(isbn),
>>>>>>>>   FLATTEN(booktitle),
>>>>>>>>   ...
>>>>>>>> values = FOREACH flattened GENERATE
>>>>>>>>   $1 AS ISBN,
>>>>>>>>   $3 AS BookTitle,
>>>>>>>>   ...
>>>>>>>>
>>>>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>>>>> being out of bounds.
>>>>>>>>
>>>>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>>>>> something wrong, or have I stumbled across a defect?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chad
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Chad Johnston <cj...@megatome.com>.

You're trying to use FromCqlColumn on a tuple that has been flattened. The
schema still thinks it's {title: chararray}, but the flattened tuple is now
two values. I don't know how to retrieve the data values in this case.

Your code will work correctly if you do this:
*values3 = FOREACH rows GENERATE FromCqlColumn(title) AS title;*
*dump values3;*
*describe values3;*

(Use FromCqlColumn on the original data, not the flattened data.)

Chad


On Mon, Sep 2, 2013 at 8:45 AM, Miguel Angel Martin junquera <
mianmarjun.mailinglist@gmail.com> wrote:

> Hi
>
>
> 1.-
>
> May be?
>
> -- Register the UDF
> REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT
>
> -- FromCqlColumn will convert chararray, int, long, float, double
> DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();
>
> -- Load data as normal
> data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();
>
> -- Use the UDF
> data = FOREACH data_raw GENERATE
>     *FromCqlColumn*(isbn) AS ISBN,
>     *FromCqlColumn*(bookauthor) AS BookAuthor,
>
>     *FromCqlColumn*(booktitle) AS BookTitle,
>     *FromCqlColumn*(publisher) AS Publisher,
>
>     *FromCqlColumn*(yearofpublication) AS YearOfPublication;
>
>
>
>
>
> and  2.:
>
> with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:
>
> *CREATE KEYSPACE keyspace1*
>
> *  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
> : 1 }*
>
> *  AND durable_writes = true;*
>
> *
> *
>
> *use keyspace2;*
>
> *
> *
>
> *  CREATE TABLE test (*
>
> *    id text PRIMARY KEY,*
>
> *    title text,*
>
> *    age int*
>
> *  )  WITH COMPACT STORAGE;*
>
> *
> *
>
> *
> *
>
> *  insert into test (id, title, age) values('1', 'child', 21);*
>
> *  insert into test (id, title, age) values('2', 'support', 21);*
>
> *  insert into test (id, title, age) values('3', 'manager', 31);*
>
> *  insert into test (id, title, age) values('4', 'QA', 41);*
>
> *  insert into test (id, title, age) values('5', 'QA', 30);*
>
> *  insert into test (id, title, age) values('6', 'QA', 30);*
>
>
>
>
>
> and script:
>
> *
> *
> *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
> *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
> *dump rows;*
> *ILLUSTRATE rows;*
> *describe rows;*
> *A = FOREACH rows GENERATE FLATTEN(title);*
> *dump A;*
> *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
> *dump values3;*
> *describe values3;*
>
>
> --
>
>
>
> I have this error:
>
>
>
>
> ....
>
> -------------------------------------------------------------
> | rows     | id:chararray   | age:int   | title:chararray   |
> -------------------------------------------------------------
> |          | (id, 5)        | (age, 30) | (title, QA)       |
> -------------------------------------------------------------
>
> rows: {id: chararray,age: int,title: chararray}
>
>
> ...
>
> (title,QA)
> (title,QA)
> ..
> 2013-09-02 16:40:52,454 [Thread-11] WARN
>  org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
> *java.lang.ClassCastException: java.lang.String cannot be cast to
> org.apache.pig.data.Tuple*
> at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2013-09-02 16:40:52,832 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0003
>
>
>
> 8-|
>
> Regards
>
> ...
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.martin@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>
>
>> hi all:
>>
>> More info :
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-5941
>>
>>
>>
>> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>>
>> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
>> cd cassandra
>> git checkout cassandra-1.2
>> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
>> ant
>>
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.martin@brainsins.com
>>
>>
>>
>> 2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>
>>
>>> *good/nice job !!!*
>>> *
>>> *
>>> *
>>> *
>>> *I'd testing with an udf only with  string schema type  this is better
>>> and elaborate work..*
>>> *
>>> *
>>> *Regads*
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.martin@brainsins.com
>>>
>>>
>>>
>>> 2013/8/31 Chad Johnston <cj...@megatome.com>
>>>
>>>> I threw together a quick UDF to work around this issue. It just
>>>> extracts the value portion of the tuple while taking advantage of the
>>>> CqlStorage generated schema to keep the type correct.
>>>>
>>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>>
>>>> I'll see if I can find more useful information and open a defect, since
>>>> that's what this seems to be.
>>>>
>>>> Chad
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>>> mianmarjun.mailinglist@gmail.com> wrote:
>>>>
>>>>> I try this:
>>>>>
>>>>>  *rows = LOAD
>>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>>>>> CqlStorage();*
>>>>>
>>>>> *dump rows;*
>>>>>
>>>>> *ILLUSTRATE rows;*
>>>>>
>>>>> *describe rows;*
>>>>>
>>>>> *
>>>>> *
>>>>>
>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>>> (mycolumn:tuple(name,value));*
>>>>>
>>>>> *dump values2;*
>>>>>
>>>>> *describe values2;*
>>>>> *
>>>>> *
>>>>>
>>>>> But I get this results:
>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------------------
>>>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>>>> -------------------------------------------------------------
>>>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>>>> -------------------------------------------------------------
>>>>>
>>>>> rows: {id: chararray,age: int,title: chararray}
>>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>> - ERROR 1031: Incompatable field schema: left is
>>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> or
>>>>>
>>>>>
>>>>>
>>>>> ....
>>>>>
>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>>> *dump values2;*
>>>>> *describe values2;*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> and  the results are:
>>>>>
>>>>>
>>>>> ...
>>>>> (((id,6)))
>>>>> (((id,5)))
>>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>>>
>>>>>
>>>>>
>>>>> Aggg!!!!!
>>>>>
>>>>>
>>>>> *
>>>>> *
>>>>>
>>>>>
>>>>>
>>>>> Miguel Angel Martín Junquera
>>>>> Analyst Engineer.
>>>>> miguelangel.martin@brainsins.com
>>>>>
>>>>>
>>>>>
>>>>> 2013/8/26 Miguel Angel Martin junquera <
>>>>> mianmarjun.mailinglist@gmail.com>
>>>>>
>>>>>> hi Chad .
>>>>>>
>>>>>> I have this issue
>>>>>>
>>>>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>>>>> can not  access to column values.
>>>>>> In this mail  I write some things that I try without results... and
>>>>>> information about this issue.
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>>>>>
>>>>>>
>>>>>>
>>>>>> I hope  someOne reply  one comment, idea or  solution about  this
>>>>>> issue or bug.
>>>>>>
>>>>>>
>>>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i
>>>>>> do not have configure the environmetn to debug  and trace this issue.
>>>>>>
>>>>>> Only  I find some comments like, but I do not understand at all.
>>>>>>
>>>>>>
>>>>>> /**
>>>>>>
>>>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>>>> Cassandra
>>>>>>
>>>>>>  *
>>>>>>
>>>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>>>
>>>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>>>>  */
>>>>>>
>>>>>>
>>>>>> I you found some idea or solution, please post it
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/8/23 Chad Johnston <cj...@megatome.com>
>>>>>>
>>>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>>>
>>>>>>> I'm loading some simple data from Cassandra into Pig using
>>>>>>> CqlStorage. The CqlStorage loader defines a Pig schema based on the
>>>>>>> Cassandra schema, but it seems to be wrong.
>>>>>>>
>>>>>>> If I do:
>>>>>>>
>>>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>>>> DESCRIBE data;
>>>>>>>
>>>>>>> I get this:
>>>>>>>
>>>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>>>
>>>>>>> However, if I DUMP data, I get results like these:
>>>>>>>
>>>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>>>
>>>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>>>>>> different.
>>>>>>>
>>>>>>> This is really causing me problems trying to access the column
>>>>>>> values. I tried a naive approach of FLATTENing each tuple, then trying to
>>>>>>> access the values that way:
>>>>>>>
>>>>>>> flattened = FOREACH data GENERATE
>>>>>>>   FLATTEN(isbn),
>>>>>>>   FLATTEN(booktitle),
>>>>>>>   ...
>>>>>>> values = FOREACH flattened GENERATE
>>>>>>>   $1 AS ISBN,
>>>>>>>   $3 AS BookTitle,
>>>>>>>   ...
>>>>>>>
>>>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>>>> being out of bounds.
>>>>>>>
>>>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>>>> something wrong, or have I stumbled across a defect?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Chad
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Miguel Angel Martin junquera <mi...@gmail.com>.

Hi


1.-

May be?

-- Register the UDF
REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT

-- FromCqlColumn will convert chararray, int, long, float, double
DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();

-- Load data as normal
data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();

-- Use the UDF
data = FOREACH data_raw GENERATE
    *FromCqlColumn*(isbn) AS ISBN,
    *FromCqlColumn*(bookauthor) AS BookAuthor,
    *FromCqlColumn*(booktitle) AS BookTitle,
    *FromCqlColumn*(publisher) AS Publisher,
    *FromCqlColumn*(yearofpublication) AS YearOfPublication;





and  2.:

with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:

*CREATE KEYSPACE keyspace1*

*  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' :
1 }*

*  AND durable_writes = true;*

*
*

*use keyspace2;*

*
*

*  CREATE TABLE test (*

*    id text PRIMARY KEY,*

*    title text,*

*    age int*

*  )  WITH COMPACT STORAGE;*

*
*

*
*

*  insert into test (id, title, age) values('1', 'child', 21);*

*  insert into test (id, title, age) values('2', 'support', 21);*

*  insert into test (id, title, age) values('3', 'manager', 31);*

*  insert into test (id, title, age) values('4', 'QA', 41);*

*  insert into test (id, title, age) values('5', 'QA', 30);*

*  insert into test (id, title, age) values('6', 'QA', 30);*





and script:

*
*
*register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
*DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
*rows = LOAD
'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
CqlStorage();*
*dump rows;*
*ILLUSTRATE rows;*
*describe rows;*
*A = FOREACH rows GENERATE FLATTEN(title);*
*dump A;*
*values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
*dump values3;*
*describe values3;*


--



I have this error:




....

-------------------------------------------------------------
| rows     | id:chararray   | age:int   | title:chararray   |
-------------------------------------------------------------
|          | (id, 5)        | (age, 30) | (title, QA)       |
-------------------------------------------------------------

rows: {id: chararray,age: int,title: chararray}


...

(title,QA)
(title,QA)
..
2013-09-02 16:40:52,454 [Thread-11] WARN
 org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
*java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple*
at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-09-02 16:40:52,832 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_local_0003



8-|

Regards

...


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.martin@brainsins.com



2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>

> hi all:
>
> More info :
>
> https://issues.apache.org/jira/browse/CASSANDRA-5941
>
>
>
> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>
> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
> cd cassandra
> git checkout cassandra-1.2
> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
> ant
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.martin@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>
>
>> *good/nice job !!!*
>> *
>> *
>> *
>> *
>> *I'd testing with an udf only with  string schema type  this is better
>> and elaborate work..*
>> *
>> *
>> *Regads*
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.martin@brainsins.com
>>
>>
>>
>> 2013/8/31 Chad Johnston <cj...@megatome.com>
>>
>>> I threw together a quick UDF to work around this issue. It just extracts
>>> the value portion of the tuple while taking advantage of the CqlStorage
>>> generated schema to keep the type correct.
>>>
>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>
>>> I'll see if I can find more useful information and open a defect, since
>>> that's what this seems to be.
>>>
>>> Chad
>>>
>>>
>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>> mianmarjun.mailinglist@gmail.com> wrote:
>>>
>>>> I try this:
>>>>
>>>> *rows = LOAD
>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>>>> CqlStorage();*
>>>>
>>>> *dump rows;*
>>>>
>>>> *ILLUSTRATE rows;*
>>>>
>>>> *describe rows;*
>>>>
>>>> *
>>>> *
>>>>
>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>> (mycolumn:tuple(name,value));*
>>>>
>>>> *dump values2;*
>>>>
>>>> *describe values2;*
>>>> *
>>>> *
>>>>
>>>> But I get this results:
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------
>>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>>> -------------------------------------------------------------
>>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>>> -------------------------------------------------------------
>>>>
>>>> rows: {id: chararray,age: int,title: chararray}
>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>>> ERROR 1031: Incompatable field schema: left is
>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> or
>>>>
>>>>
>>>>
>>>> ....
>>>>
>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>> *dump values2;*
>>>> *describe values2;*
>>>>
>>>>
>>>>
>>>>
>>>> and  the results are:
>>>>
>>>>
>>>> ...
>>>> (((id,6)))
>>>> (((id,5)))
>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>>
>>>>
>>>>
>>>> Aggg!!!!!
>>>>
>>>>
>>>> *
>>>> *
>>>>
>>>>
>>>>
>>>> Miguel Angel Martín Junquera
>>>> Analyst Engineer.
>>>> miguelangel.martin@brainsins.com
>>>>
>>>>
>>>>
>>>> 2013/8/26 Miguel Angel Martin junquera <
>>>> mianmarjun.mailinglist@gmail.com>
>>>>
>>>>> hi Chad .
>>>>>
>>>>> I have this issue
>>>>>
>>>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>>>> can not  access to column values.
>>>>> In this mail  I write some things that I try without results... and
>>>>> information about this issue.
>>>>>
>>>>>
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>>>>
>>>>>
>>>>>
>>>>> I hope  someOne reply  one comment, idea or  solution about  this
>>>>> issue or bug.
>>>>>
>>>>>
>>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>>>>> not have configure the environmetn to debug  and trace this issue.
>>>>>
>>>>> Only  I find some comments like, but I do not understand at all.
>>>>>
>>>>>
>>>>> /**
>>>>>
>>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>>> Cassandra
>>>>>
>>>>>  *
>>>>>
>>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>>
>>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>>>  */
>>>>>
>>>>>
>>>>> I you found some idea or solution, please post it
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2013/8/23 Chad Johnston <cj...@megatome.com>
>>>>>
>>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>>
>>>>>> I'm loading some simple data from Cassandra into Pig using
>>>>>> CqlStorage. The CqlStorage loader defines a Pig schema based on the
>>>>>> Cassandra schema, but it seems to be wrong.
>>>>>>
>>>>>> If I do:
>>>>>>
>>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>>> DESCRIBE data;
>>>>>>
>>>>>> I get this:
>>>>>>
>>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>>
>>>>>> However, if I DUMP data, I get results like these:
>>>>>>
>>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>>
>>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>>>>> different.
>>>>>>
>>>>>> This is really causing me problems trying to access the column
>>>>>> values. I tried a naive approach of FLATTENing each tuple, then trying to
>>>>>> access the values that way:
>>>>>>
>>>>>> flattened = FOREACH data GENERATE
>>>>>>   FLATTEN(isbn),
>>>>>>   FLATTEN(booktitle),
>>>>>>   ...
>>>>>> values = FOREACH flattened GENERATE
>>>>>>   $1 AS ISBN,
>>>>>>   $3 AS BookTitle,
>>>>>>   ...
>>>>>>
>>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>>> being out of bounds.
>>>>>>
>>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>>> something wrong, or have I stumbled across a defect?
>>>>>>
>>>>>> Thanks,
>>>>>> Chad
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Miguel Angel Martin junquera <mi...@gmail.com>.

hi all:

More info :

https://issues.apache.org/jira/browse/CASSANDRA-5941



I tried this (and gen. cassandra 1.2.9)  but do not work for me,

git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
cd cassandra
git checkout cassandra-1.2
patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
ant



Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.martin@brainsins.com



2013/9/2 Miguel Angel Martin junquera <mi...@gmail.com>

> *good/nice job !!!*
> *
> *
> *
> *
> *I'd testing with an udf only with  string schema type  this is better
> and elaborate work..*
> *
> *
> *Regads*
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.martin@brainsins.com
>
>
>
> 2013/8/31 Chad Johnston <cj...@megatome.com>
>
>> I threw together a quick UDF to work around this issue. It just extracts
>> the value portion of the tuple while taking advantage of the CqlStorage
>> generated schema to keep the type correct.
>>
>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>
>> I'll see if I can find more useful information and open a defect, since
>> that's what this seems to be.
>>
>> Chad
>>
>>
>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>> mianmarjun.mailinglist@gmail.com> wrote:
>>
>>> I try this:
>>>
>>> *rows = LOAD
>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>>> CqlStorage();*
>>>
>>> *dump rows;*
>>>
>>> *ILLUSTRATE rows;*
>>>
>>> *describe rows;*
>>>
>>> *
>>> *
>>>
>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>> (mycolumn:tuple(name,value));*
>>>
>>> *dump values2;*
>>>
>>> *describe values2;*
>>> *
>>> *
>>>
>>> But I get this results:
>>>
>>>
>>>
>>> -------------------------------------------------------------
>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>> -------------------------------------------------------------
>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>> -------------------------------------------------------------
>>>
>>> rows: {id: chararray,age: int,title: chararray}
>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1031: Incompatable field schema: left is
>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>
>>>
>>>
>>>
>>>
>>> or
>>>
>>>
>>>
>>> ....
>>>
>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>> *dump values2;*
>>> *describe values2;*
>>>
>>>
>>>
>>>
>>> and  the results are:
>>>
>>>
>>> ...
>>> (((id,6)))
>>> (((id,5)))
>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>
>>>
>>>
>>> Aggg!!!!!
>>>
>>>
>>> *
>>> *
>>>
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.martin@brainsins.com
>>>
>>>
>>>
>>> 2013/8/26 Miguel Angel Martin junquera <mianmarjun.mailinglist@gmail.com
>>> >
>>>
>>>> hi Chad .
>>>>
>>>> I have this issue
>>>>
>>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>>> can not  access to column values.
>>>> In this mail  I write some things that I try without results... and
>>>> information about this issue.
>>>>
>>>>
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>>>
>>>>
>>>>
>>>> I hope  someOne reply  one comment, idea or  solution about  this issue
>>>> or bug.
>>>>
>>>>
>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>>>> not have configure the environmetn to debug  and trace this issue.
>>>>
>>>> Only  I find some comments like, but I do not understand at all.
>>>>
>>>>
>>>> /**
>>>>
>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>> Cassandra
>>>>
>>>>  *
>>>>
>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>
>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>>  */
>>>>
>>>>
>>>> I you found some idea or solution, please post it
>>>>
>>>> thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2013/8/23 Chad Johnston <cj...@megatome.com>
>>>>
>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>
>>>>> I'm loading some simple data from Cassandra into Pig using CqlStorage.
>>>>> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
>>>>> but it seems to be wrong.
>>>>>
>>>>> If I do:
>>>>>
>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>> DESCRIBE data;
>>>>>
>>>>> I get this:
>>>>>
>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>
>>>>> However, if I DUMP data, I get results like these:
>>>>>
>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>
>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>>>> different.
>>>>>
>>>>> This is really causing me problems trying to access the column values.
>>>>> I tried a naive approach of FLATTENing each tuple, then trying to access
>>>>> the values that way:
>>>>>
>>>>> flattened = FOREACH data GENERATE
>>>>>   FLATTEN(isbn),
>>>>>   FLATTEN(booktitle),
>>>>>   ...
>>>>> values = FOREACH flattened GENERATE
>>>>>   $1 AS ISBN,
>>>>>   $3 AS BookTitle,
>>>>>   ...
>>>>>
>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>> being out of bounds.
>>>>>
>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>> something wrong, or have I stumbled across a defect?
>>>>>
>>>>> Thanks,
>>>>> Chad
>>>>>
>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Miguel Angel Martin junquera <mi...@gmail.com>.

*good/nice job !!!*
*
*
*
*
*I'd testing with an udf only with  string schema type  this is better and
elaborate work..*
*
*
*Regads*


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.martin@brainsins.com



2013/8/31 Chad Johnston <cj...@megatome.com>

> I threw together a quick UDF to work around this issue. It just extracts
> the value portion of the tuple while taking advantage of the CqlStorage
> generated schema to keep the type correct.
>
> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>
> I'll see if I can find more useful information and open a defect, since
> that's what this seems to be.
>
> Chad
>
>
> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
> mianmarjun.mailinglist@gmail.com> wrote:
>
>> I try this:
>>
>> *rows = LOAD
>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>> CqlStorage();*
>>
>> *dump rows;*
>>
>> *ILLUSTRATE rows;*
>>
>> *describe rows;*
>>
>> *
>> *
>>
>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>> (mycolumn:tuple(name,value));*
>>
>> *dump values2;*
>>
>> *describe values2;*
>> *
>> *
>>
>> But I get this results:
>>
>>
>>
>> -------------------------------------------------------------
>> | rows     | id:chararray   | age:int   | title:chararray   |
>> -------------------------------------------------------------
>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>> -------------------------------------------------------------
>>
>> rows: {id: chararray,age: int,title: chararray}
>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1031: Incompatable field schema: left is
>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>
>>
>>
>>
>>
>> or
>>
>>
>>
>> ....
>>
>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>> *dump values2;*
>> *describe values2;*
>>
>>
>>
>>
>> and  the results are:
>>
>>
>> ...
>> (((id,6)))
>> (((id,5)))
>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>
>>
>>
>> Aggg!!!!!
>>
>>
>> *
>> *
>>
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.martin@brainsins.com
>>
>>
>>
>> 2013/8/26 Miguel Angel Martin junquera <mi...@gmail.com>
>>
>>> hi Chad .
>>>
>>> I have this issue
>>>
>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>> can not  access to column values.
>>> In this mail  I write some things that I try without results... and
>>> information about this issue.
>>>
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>>
>>>
>>>
>>> I hope  someOne reply  one comment, idea or  solution about  this issue
>>> or bug.
>>>
>>>
>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>>> not have configure the environmetn to debug  and trace this issue.
>>>
>>> Only  I find some comments like, but I do not understand at all.
>>>
>>>
>>> /**
>>>
>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>> Cassandra
>>>
>>>  *
>>>
>>>  * A row from a standard CF will be returned as nested tuples:
>>>
>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>  */
>>>
>>>
>>> I you found some idea or solution, please post it
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2013/8/23 Chad Johnston <cj...@megatome.com>
>>>
>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>
>>>> I'm loading some simple data from Cassandra into Pig using CqlStorage.
>>>> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
>>>> but it seems to be wrong.
>>>>
>>>> If I do:
>>>>
>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>> DESCRIBE data;
>>>>
>>>> I get this:
>>>>
>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>
>>>> However, if I DUMP data, I get results like these:
>>>>
>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
>>>> Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>
>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>>> different.
>>>>
>>>> This is really causing me problems trying to access the column values.
>>>> I tried a naive approach of FLATTENing each tuple, then trying to access
>>>> the values that way:
>>>>
>>>> flattened = FOREACH data GENERATE
>>>>   FLATTEN(isbn),
>>>>   FLATTEN(booktitle),
>>>>   ...
>>>> values = FOREACH flattened GENERATE
>>>>   $1 AS ISBN,
>>>>   $3 AS BookTitle,
>>>>   ...
>>>>
>>>> As soon as I try to access field $5, Pig complains about the index
>>>> being out of bounds.
>>>>
>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>> something wrong, or have I stumbled across a defect?
>>>>
>>>> Thanks,
>>>> Chad
>>>>
>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Chad Johnston <cj...@megatome.com>.

I threw together a quick UDF to work around this issue. It just extracts
the value portion of the tuple while taking advantage of the CqlStorage
generated schema to keep the type correct.

You can get it here: https://github.com/iamthechad/cqlstorage-udf

I'll see if I can find more useful information and open a defect, since
that's what this seems to be.

Chad


On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
mianmarjun.mailinglist@gmail.com> wrote:

> I try this:
>
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
>
> *dump rows;*
>
> *ILLUSTRATE rows;*
>
> *describe rows;*
>
> *
> *
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
> (mycolumn:tuple(name,value));*
>
> *dump values2;*
>
> *describe values2;*
> *
> *
>
> But I get this results:
>
>
>
> -------------------------------------------------------------
> | rows     | id:chararray   | age:int   | title:chararray   |
> -------------------------------------------------------------
> |          | (id, 6)        | (age, 30) | (title, QA)       |
> -------------------------------------------------------------
>
> rows: {id: chararray,age: int,title: chararray}
> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1031: Incompatable field schema: left is
> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>
>
>
>
>
> or
>
>
>
> ....
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
> *dump values2;*
> *describe values2;*
>
>
>
>
> and  the results are:
>
>
> ...
> (((id,6)))
> (((id,5)))
> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>
>
>
> Aggg!!!!!
>
>
> *
> *
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.martin@brainsins.com
>
>
>
> 2013/8/26 Miguel Angel Martin junquera <mi...@gmail.com>
>
>> hi Chad .
>>
>> I have this issue
>>
>> I send a mail to user-pig-list and  I still i can resolve this, and I can
>> not  access to column values.
>> In this mail  I write some things that I try without results... and
>> information about this issue.
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>
>>
>>
>> I hope  someOne reply  one comment, idea or  solution about  this issue
>> or bug.
>>
>>
>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>> not have configure the environmetn to debug  and trace this issue.
>>
>> Only  I find some comments like, but I do not understand at all.
>>
>>
>> /**
>>
>>  * A LoadStoreFunc for retrieving data from and storing data to Cassandra
>>
>>  *
>>
>>  * A row from a standard CF will be returned as nested tuples:
>>
>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>  */
>>
>>
>> I you found some idea or solution, please post it
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2013/8/23 Chad Johnston <cj...@megatome.com>
>>
>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>
>>> I'm loading some simple data from Cassandra into Pig using CqlStorage.
>>> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
>>> but it seems to be wrong.
>>>
>>> If I do:
>>>
>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>> DESCRIBE data;
>>>
>>> I get this:
>>>
>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>> chararray,publisher: chararray,yearofpublication: int}
>>>
>>> However, if I DUMP data, I get results like these:
>>>
>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
>>> Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>
>>> Clearly the results from Cassandra are key/value pairs, as would be
>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>> different.
>>>
>>> This is really causing me problems trying to access the column values. I
>>> tried a naive approach of FLATTENing each tuple, then trying to access the
>>> values that way:
>>>
>>> flattened = FOREACH data GENERATE
>>>   FLATTEN(isbn),
>>>   FLATTEN(booktitle),
>>>   ...
>>> values = FOREACH flattened GENERATE
>>>   $1 AS ISBN,
>>>   $3 AS BookTitle,
>>>   ...
>>>
>>> As soon as I try to access field $5, Pig complains about the index being
>>> out of bounds.
>>>
>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>> something wrong, or have I stumbled across a defect?
>>>
>>> Thanks,
>>> Chad
>>>
>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Miguel Angel Martin junquera <mi...@gmail.com>.

I try this:

*rows = LOAD
'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
CqlStorage();*

*dump rows;*

*ILLUSTRATE rows;*

*describe rows;*

*
*

*values2= FOREACH rows GENERATE  TOTUPLE (id) as
(mycolumn:tuple(name,value));*

*dump values2;*

*describe values2;*
*
*

But I get this results:



-------------------------------------------------------------
| rows     | id:chararray   | age:int   | title:chararray   |
-------------------------------------------------------------
|          | (id, 6)        | (age, 30) | (title, QA)       |
-------------------------------------------------------------

rows: {id: chararray,age: int,title: chararray}
2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1031: Incompatable field schema: left is
"tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
"org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"





or



....

*values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
*dump values2;*
*describe values2;*




and  the results are:


...
(((id,6)))
(((id,5)))
values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}



Aggg!!!!!


*
*



Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.martin@brainsins.com



2013/8/26 Miguel Angel Martin junquera <mi...@gmail.com>

> hi Chad .
>
> I have this issue
>
> I send a mail to user-pig-list and  I still i can resolve this, and I can
> not  access to column values.
> In this mail  I write some things that I try without results... and
> information about this issue.
>
>
>
> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>
>
>
> I hope  someOne reply  one comment, idea or  solution about  this issue or
> bug.
>
>
> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do not
> have configure the environmetn to debug  and trace this issue.
>
> Only  I find some comments like, but I do not understand at all.
>
>
> /**
>
>  * A LoadStoreFunc for retrieving data from and storing data to Cassandra
>
>  *
>
>  * A row from a standard CF will be returned as nested tuples:
>
>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>  */
>
>
> I you found some idea or solution, please post it
>
> thanks
>
>
>
>
>
>
>
>
>
> 2013/8/23 Chad Johnston <cj...@megatome.com>
>
>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>
>> I'm loading some simple data from Cassandra into Pig using CqlStorage.
>> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
>> but it seems to be wrong.
>>
>> If I do:
>>
>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>> DESCRIBE data;
>>
>> I get this:
>>
>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>> chararray,publisher: chararray,yearofpublication: int}
>>
>> However, if I DUMP data, I get results like these:
>>
>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
>> Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>
>> Clearly the results from Cassandra are key/value pairs, as would be
>> expected. I don't know why the schema generated by CqlStorage() would be so
>> different.
>>
>> This is really causing me problems trying to access the column values. I
>> tried a naive approach of FLATTENing each tuple, then trying to access the
>> values that way:
>>
>> flattened = FOREACH data GENERATE
>>   FLATTEN(isbn),
>>   FLATTEN(booktitle),
>>   ...
>> values = FOREACH flattened GENERATE
>>   $1 AS ISBN,
>>   $3 AS BookTitle,
>>   ...
>>
>> As soon as I try to access field $5, Pig complains about the index being
>> out of bounds.
>>
>> Is there a way to solve the schema/reality mismatch? Am I doing something
>> wrong, or have I stumbled across a defect?
>>
>> Thanks,
>> Chad
>>
>
>

Re: CqlStorage creates wrong schema for Pig

Posted by Miguel Angel Martin junquera <mi...@gmail.com>.

hi Chad .

I have this issue

I send a mail to user-pig-list and  I still i can resolve this, and I can
not  access to column values.
In this mail  I write some things that I try without results... and
information about this issue.


http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E



I hope  someOne reply  one comment, idea or  solution about  this issue or
bug.


I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do not
have configure the environmetn to debug  and trace this issue.

Only  I find some comments like, but I do not understand at all.


/**

 * A LoadStoreFunc for retrieving data from and storing data to Cassandra

 *

 * A row from a standard CF will be returned as nested tuples:

 * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
 */


I you found some idea or solution, please post it

thanks









2013/8/23 Chad Johnston <cj...@megatome.com>

> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>
> I'm loading some simple data from Cassandra into Pig using CqlStorage. The
> CqlStorage loader defines a Pig schema based on the Cassandra schema, but
> it seems to be wrong.
>
> If I do:
>
> data = LOAD 'cql://bookdata/books' USING CqlStorage();
> DESCRIBE data;
>
> I get this:
>
> data: {isbn: chararray,bookauthor: chararray,booktitle:
> chararray,publisher: chararray,yearofpublication: int}
>
> However, if I DUMP data, I get results like these:
>
> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
> Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>
> Clearly the results from Cassandra are key/value pairs, as would be
> expected. I don't know why the schema generated by CqlStorage() would be so
> different.
>
> This is really causing me problems trying to access the column values. I
> tried a naive approach of FLATTENing each tuple, then trying to access the
> values that way:
>
> flattened = FOREACH data GENERATE
>   FLATTEN(isbn),
>   FLATTEN(booktitle),
>   ...
> values = FOREACH flattened GENERATE
>   $1 AS ISBN,
>   $3 AS BookTitle,
>   ...
>
> As soon as I try to access field $5, Pig complains about the index being
> out of bounds.
>
> Is there a way to solve the schema/reality mismatch? Am I doing something
> wrong, or have I stumbled across a defect?
>
> Thanks,
> Chad
>