You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by aaron morton <aa...@thelastpickle.com> on 2011/08/01 23:55:04 UTC

Re: Cassandra bulk import confusion

Incase you missed it, fresh off the press http://www.datastax.com/dev/blog/bulk-loading

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 30 Jul 2011, at 04:10, Jeff Schmidt wrote:

> Hello:
> 
> I'm relatively new to Cassandra, but I've been searching around, and it looks like Cassandra 0.8.x has improved support for bulk importing of data.  I keep finding references to the json2sstable command, and I've read about that on the Datastax and Apache documentation pages.
> 
> There's a lot of detail here if you want it, otherwise please skip to the end. json2sstable seems to run successfully, but I cannot see the data in the new CF using the CLI.
> 
> My goal is extract data from various sources, munge it together in some manner, and then bulk load it into Cassandra.  That is as opposed to using Hector to programmatically insert the data.  I'd like to deploy these files to the cloud (Puppet) and then instruct Cassndra to bulk load them, and then inform the application that new data exists.  This is for a period content update of certain column families of curated, read-only, data that occurs on a monthly basis. I'm thinking of using JMX to signal the application to switch to a new set of CFs and keep running w/o downtime.  At a later time, I'll delete the old CFs.
> 
> I'm using Cassandra 0.8.2 and I'm just playing with this concept.  I create a test CF using the CLI
> 
> [default@Ingenuity] use Test;
> Authenticated to keyspace: Test
> [default@Test] create column family TestCF with comparator = UTF8Type and column_metadata = [{column_name: nodeId, validation_class: UTF8Type}];
> 28991070-b9f9-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] update column family TestCF with key_validation_class=UTF8Type; 
> 2af88440-b9f9-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] set TestCF['SID|123']['nodeId'] = 'ING:001';  
> Value inserted.
> [default@Test] set TestCF['EG|3030']['nodeId'] = 'ING:002';  
> Value inserted.
> [default@Test] set TestCF['EG|3031']['nodeId'] = 'ING:003'; 
> Value inserted.
> [default@Test] list TestCF;
> Using default limit of 100
> -------------------
> RowKey: EG|3030
> => (column=nodeId, value=ING:002, timestamp=1311954072252000)
> -------------------
> RowKey: EG|3031
> => (column=nodeId, value=ING:003, timestamp=1311954073631000)
> -------------------
> RowKey: SID|123
> => (column=nodeId, value=ING:001, timestamp=1311954072249000)
> 
> 3 Rows Returned.
> [default@Test] 
> 
> Now, cassandra.yaml is stock, except I changed it to place the data in a non-default location:
> 
> # directories where Cassandra should store data on disk.
> data_file_directories:
>     - /usr/local/ingenuity/isec/cassandra/datastore/data
> 
> # commit log
> commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog
> 
> # saved caches
> saved_caches_directory: /usr/local/ingenuity/isec/cassandra/datastore/saved_caches
> 
> In that data directory:
> 
> [imac:datastore/data/Test] jas% pwd
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test
> [imac:datastore/data/Test] jas% ls
> [imac:datastore/data/Test] jas% 
> 
> There is nothing there.  Perhaps Cassandra has not yet felt the need to write the SSTables.  So, since I need to reference in actual data file with sstable2json, I ran nodetool flush:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost flush Test TestCF
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Now, I have files!
> 
> [imac:datastore/data/Test] jas% pwd
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test
> [imac:datastore/data/Test] jas% ls
> TestCF-g-1-Data.db		TestCF-g-1-Index.db
> TestCF-g-1-Filter.db		TestCF-g-1-Statistics.db
> [imac:datastore/data/Test] jas% 
> 
> Given that, I'm able run sstable2json and I can see I'm getting what's in that CF:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas%  bin/sstable2json /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db > testcf.jason
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason 
> {
> "45477c33303330": [["nodeId","ING:002",1311954072252000]],
> "45477c33303331": [["nodeId","ING:003",1311954073631000]],
> "5349447c313233": [["nodeId","ING:001",1311954072249000]]
> }
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Oops, okay, that file extension should be json not jason, but oh well... :)
> 
> Okay, so I now I have data in the proper format for importing with json2sstable.  Like I said, I want to import this data into a new CF. Let's call it TestCF2 (in the same keyspace):
> 
> [default@Test] create column family TestCF2 with comparator = UTF8Type and column_metadata = [{column_name: nodeId, validation_class: UTF8Type}];
> 4dcc44b0-b9fa-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] update column family TestCF2 with key_validation_class=UTF8Type; 
> 5092dec0-b9fa-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] 
> 
> Again there are no files created in the data directory, so I do a flush for the new CF:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost flush Test TestCF2
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Well, that did not help, still no files for TestCF2.  There is no actual data yet, so I'm guessing the system tables have what they need. So, I go ahead and import the data using json2sstable:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K Test -c TestCF2 testcf.jason /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.db
> Importing 3 keys...
> 3 keys imported successfully.
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Okay, and the files did show up:
> 
> [imac:datastore/data/Test] jas% pwd
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test
> [imac:datastore/data/Test] jas% ls
> TestCF-g-1-Data.db		TestCF2-g-1-Data.db
> TestCF-g-1-Filter.db		TestCF2-g-1-Filter.db
> TestCF-g-1-Index.db		TestCF2-g-1-Index.db
> TestCF-g-1-Statistics.db	TestCF2-g-1-Statistics.db
> [imac:datastore/data/Test] jas% 
> 
> Back in the CLI:
> 
> [default@Test] list TestCF2;
> Using default limit of 100
> 
> 0 Row Returned.
> [default@Test] 
> 
> However, if I edit TestCF-g-1-Data.db, I can sort of see the data is present.  Quitting and starting the CLI has no affect. What gets the the CF data into the MemTables so it's accessible to a Cassandra client?   I tried various nodetool commands (repair, compact, cleanup, flush, invalidatekeycache, invalidaterowcache) and I don't see any rows for TestCF2 in the CLI.
> 
> Anyway, it seems this procedure works as I'd expect, well except for not seeing the new data. :)
> 
> What am I missing here?
> 
> Thanks,
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> jas@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: Cassandra bulk import confusion

Posted by Viksit Gaur <vi...@bloomreach.com>.
> On 30 Jul 2011, at 04:10, Jeff Schmidt wrote:
> Hello:
> I'm relatively new to Cassandra, but I've been searching around, 
> and it looks like Cassandra 0.8.x 
>has improved support for bulk importing of data.  I keep finding 
> references to the json2sstable 
>command, and I've read about that on the Datastax and Apache 
> documentation pages.
> 
> There's a lot of detail here if you want it, otherwise please skip 
> to the end. json2sstable seems to 
run successfully, but I cannot see the data in the new CF using the CLI.
> 

This is a reply after a long time, but the main way to resolve this is,

- Run nodetool refresh <keyspace> <cfname>
- Ensure that the data files are named correctly
 according the keyspace-cfname convention

- Viksit