You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/05/20 22:49:03 UTC

CassandraStorage loader generating 2x many record?

(accidentally cross posted this to the cassandra list… when I meant to post
it here)

This has to be a bug or either that or I'm insane.

Here's my table in Cassandra:

CREATE TABLE test_source (
  id int ,
  primary key(id)
);

INSERT INTO test_source (ID) VALUES(1);
INSERT INTO test_source (ID) VALUES(2);
INSERT INTO test_source (ID) VALUES(3);
INSERT INTO test_source (ID) VALUES(4);

cqlsh:blogindex> select * from test_source;

 id
----
  1
  2
  4
  3

(4 rows)

… now I load that into pig and run:

test_source = LOAD 'cassandra://blogindex/test_source' USING
CassandraStorage() AS (source, target: bag {T: tuple(name, value)});

dump test_source;

(4,{((),)})
(1,{((),)})
(2,{((),)})
(4,{((),)})
(1,{((),)})
(3,{((),)})
(3,{((),)})
(2,{((),)})

… now it COULD be a bug with 'dump' … but even then that's a bug.

I suspect that Cassandra might be getting confused and giving too many rows
to pig due to maybe duplicating input splits?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profile<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.