You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ondřej Černoš <ce...@gmail.com> on 2013/11/19 16:52:47 UTC
Cassandra and Pig - CQL maps denormalisation

Hi all,

I am solving a issue with pig integration with cassandra using CqlLoader. I
don't know exactly if the problem is in CqlLoader, my low understanding of
Pig (I hope this is actually the case) or some bug in the combination of
Pig and CqlLoader. Sorry if this turns out to be rather a Pig question and
not a Cassandra one.

I have a table using cql maps:

CREATE TABLE test (
  name text PRIMARY KEY,
  sources map<text, text>
)

I need to denormalise the map in order to perform some sanitary checks on
the rest of the DB (outer join using values from the map with another
tables in cassandra keyspace). I want to create triples containing table
key, map key and map value for further joining. The size of the map is
anything between null and tens of records. The table test itself is pretty
small.

This is what I do:

grunt> data = LOAD 'cql://keyspace/test' USING CqlStorage();
grunt> describe data;
data: {name: chararray,sources: ()}
grunt> data1 = filter data by sources is not null;
grunt> dump data1;
(name1,((k1,s1),(k2,s2)))
grunt> data2 = foreach data1 generate name, flatten(sources);
grunt> dump data2;
(name1,(k1,s1),(k2,s2))
grunt> describe data2;
Schema for data2 unknown.
grunt> data3 = FOREACH data2 generate $0 as name, FLATTEN(TOBAG($1..$100));
// I know there will be max tens of records in the map
grunt> dump data3;
(name1,k1,s1)
(name1,k2,s2)
(name1,)
(name1,)
... 95 more lines here ...
grunt> data4 = FILTER data3 BY $1 IS NOT null;
grunt> dump data4;
(name1,k1,s1)
(name1,k2,s2)
grunt> describe data4;
data4: {name: bytearray,bytearray}
grunt> data5 = foreach data4 generate $0, $1;
grunt> dump data5;
(name1,k1)
(name1,k2)
grunt> p = foreach data4 generate $0, $2;
Details at logfile: /..../pig_xxx.log
>From the log file:
Pig Stack Trace
---------------
ERROR 1000:
<line 28, column 33> Out of bound access. Trying to access non-existent
column: 2. Schema name:bytearray,:bytearray has 2 column(s).

org.apache.pig.impl.plan.PlanValidationException: ERROR 1000:
<line 28, column 33> Out of bound access. Trying to access non-existent
column: 2. Schema name:bytearray,:bytearray has 2 column(s).
at
org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:197)
at
org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)

Considering the schema - no surprise. What is strange is the fact I see the
map values in dump (see dump data4), but I have no way to get them using
pig latin.

I tried to simulate the situation using PigStorage loader. This is the best
I got (not exactly the same, but roughly):

grunt> data = load 'test.csv' using PigStorage(',');
grunt> dump data;
(key1,mk1,mv1,mk2,mv2)
(key2)
(key3,mk1,mv3,mk2,mv4)
grunt> data1 = foreach data generate $0, TOTUPLE($1, $2), TOTUPLE($3, $4);
grunt> dump data1;
(key1,(mk1,mv1),(mk2,mv2))
(key2,(,),(,))
(key3,(mk1,mv3),(mk2,mv4))
grunt> data2 = FOREACH data1 generate $0 as name, FLATTEN(TOBAG($1..$2));
grunt> dump data2;
(key1,mk1,mv1)
(key1,mk2,mv2)
(key2,,)
(key2,,)
(key3,mk1,mv3)
(key3,mk2,mv4)
grunt> describe data2;
data2: {name: bytearray,bytearray,bytearray}

Which is exactly what I need. The only problem is this simulation doesn't
allow me to specify the arbitrary high value in the FLATTEN(TOBAG()) call -
I need to know in advance what is the size of the row.

Questions:

- is this the correct way to denormalize the data? This is a pig question,
but maybe someone will know (I am a pig newbie).
- couln't there be a problem with internal data representation returned
from CqlStorage? See the difference between data loaded from file and these
loaded from cassandra.

Versions: cassandra 1.2.11, Pig 0.12.

Thanks in advance,

Ondrej Cernos