You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Scott Fines <Sc...@nisc.coop> on 2011/09/30 17:29:35 UTC

sstable2json weirdness

Hi all,

I've been messing with sstable2json as a means of mass-exporting some data (mainly for backups, but also for some convenience trickery on an individual nodes' data). However, I've run into a situation where sstable2json appears to be dumping out TONS of duplicate columns for a single row.

For example, for a single key, I did

$CASSANDRA_HOME/bin/sstable2json <sstable> -k <key> > output.file

which ran until I killed it manually. Then I executed
cat output.file | sed 's/]/\n/g'  | wc -l

which gave me 40 million and some change. On the other hand,

cat output.file | sed 's/]\n/g' | sort -n | uniq | wc -l

gave me around 10K (much closer to reality).

For my particular data set, the total size of any given row cannot exceed 80K columns. So I'm wondering: Is this normal behavior for sstable2json? Assuming that it is, is there any way in which I can massage sstable2json into not emitting duplicates? These duplicates eat a great deal of disk space and processing power to manipulate, which I'd like to avoid.


Thanks for your help,

Scott



Re: sstable2json weirdness

Posted by Jonathan Ellis <jb...@gmail.com>.
It's possible we have a paging bug in sstable2json.

On Fri, Sep 30, 2011 at 10:29 AM, Scott Fines <Sc...@nisc.coop> wrote:
> Hi all,
> I've been messing with sstable2json as a means of mass-exporting some data
> (mainly for backups, but also for some convenience trickery on an individual
> nodes' data). However, I've run into a situation where sstable2json appears
> to be dumping out TONS of duplicate columns for a single row.
> For example, for a single key, I did
> $CASSANDRA_HOME/bin/sstable2json <sstable> -k <key> > output.file
> which ran until I killed it manually. Then I executed
> cat output.file | sed 's/]/\n/g'  | wc -l
> which gave me 40 million and some change. On the other hand,
> cat output.file | sed 's/]\n/g' | sort -n | uniq | wc -l
> gave me around 10K (much closer to reality).
> For my particular data set, the total size of any given row cannot exceed
> 80K columns. So I'm wondering: Is this normal behavior for sstable2json?
> Assuming that it is, is there any way in which I can massage sstable2json
> into not emitting duplicates? These duplicates eat a great deal of disk
> space and processing power to manipulate, which I'd like to avoid.
>
> Thanks for your help,
> Scott
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com