You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Keith Bogs <ke...@gmail.com> on 2013/09/20 06:13:09 UTC

BigTable-like Versioned Cells, Importing PostgreSQL Data

I've been playing with Cassandra and have a few questions that I've been
stuck on for awhile, and Googling around didn't seem to help much:

1. What's the quickest way to import a bunch of data from PostgreSQL? I
have ~20M rows with mostly text (some long text with newlines, and blob
files.) I tried exporting to CSV but had issues with newlines escaped
characters. I also tried writing an ETL tool in Go, but it was taking a
long time to go through the records.

2. How would I create a "versioned" schema with CQL? AFAIK Cassandra's cell
versions are only for conflict resolution.

I envision a wide row, with timestamps and keys representing fields of data
through time. For example, for a CF of web page contents (inspired by
Google's Bigtable paper):

Key          1379649588:body 1379649522:body 1379649123:title
a.com/1.html "<html>"                        "A"
a.com/2.html                 "<html>"        "B"
b.com/1.html "<html>"        "<html>"        "C"

But CQL doesn't seem to support this. (Yes, I've read
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows.)
Once upon a time it seems Thrift and Supercolumns maybe would work?

I'd want to efficiently iterate through the "history" of a particular row
(in other words, read all the columns for a row) or efficiently iterate
through all the latest values for the CF (not reading the entire row, just
a column slice). In the previous example, I'd want to return the latest
'body' entries with timestamps for every page ("row"/"key") in the database.

Some have talked of having two CFs, one for versioned data and one for
current values?

I've been struggling because most of the documentation revolves around
Java. I'm most comfortable with Ruby and (increasingly) Go.

I'd appreciate any insights, would really like to get Cassandra going for
real. It's been such a pleasure to setup vs. HBase and whatnot.

Keith

Re: BigTable-like Versioned Cells, Importing PostgreSQL Data

Posted by Tristan Seligmann <mi...@mithrandi.net>.

I saw nobody has responded to this so I thought I'd take a shot.

On Fri, Sep 20, 2013 at 6:13 AM, Keith Bogs <ke...@gmail.com> wrote:

>
> Key          1379649588:body 1379649522:body 1379649123:title
> a.com/1.html "<html>"                        "A"
> a.com/2.html                 "<html>"        "B"
> b.com/1.html "<html>"        "<html>"        "C"
>
> But CQL doesn't seem to support this. (Yes, I've read
> http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows.)
> Once upon a time it seems Thrift and Supercolumns maybe would work?
>

I would envision a schema something like this:

CREATE TABLE fields (
    page TEXT,
    timestamp INT,
    field_name TEXT,
    field_value TEXT,
    PRIMARY KEY (page, timestamp, field_name)
);


> I'd want to efficiently iterate through the "history" of a particular row
> (in other words, read all the columns for a row) or efficiently iterate
> through all the latest values for the CF (not reading the entire row, just
> a column slice). In the previous example, I'd want to return the latest
> 'body' entries with timestamps for every page ("row"/"key") in the database.
>
> Some have talked of having two CFs, one for versioned data and one for
> current values?
>

I think this might be advisable, as slicing a single column out of every
row would not be that efficient; then again, it might not matter if you're
trying to retrieve every row in the entire database.
-- 
mithrandi, i Ainil en-Balandor, a faer Ambar

Re: BigTable-like Versioned Cells, Importing PostgreSQL Data

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Sep 19, 2013 at 9:13 PM, Keith Bogs <ke...@gmail.com> wrote:

> I've been playing with Cassandra and have a few questions that I've been
> stuck on for awhile, and Googling around didn't seem to help much:
>
> 1. What's the quickest way to import a bunch of data from PostgreSQL? I
> have ~20M rows with mostly text (some long text with newlines, and blob
> files.) I tried exporting to CSV but had issues with newlines escaped
> characters. I also tried writing an ETL tool in Go, but it was taking a
> long time to go through the records.
>

http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

Is a good starting point.

=Rob