You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by marlon hendred <mh...@gmail.com> on 2014/04/23 22:39:05 UTC

Running hadoop jobs over compressed column familes with datastatx

Hi,

I'm attempting to dump a pig relation of a compressed column family. Its a
single column whose value is a json blob. It's compressed via snappy
compression and the value validator is BytesType. After I create the
relation and dump I get garbage. Here is the describe:

ColumnFamily: CF
      Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
      Default column value validator:
org.apache.cassandra.db.marshal.BytesType
      Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type
      GC grace seconds: 86400
      Compaction min/max thresholds: 2/32
      Read repair chance: 0.1
      DC Local Read repair chance: 0.0
      Populate IO Cache on flush: false
      Replicate on write: true
      Caching: KEYS_ONLY
      Bloom Filter FP chance: default
      Built indexes: []
      Compaction Strategy:
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
      Compression Options:
        sstable_compression:
org.apache.cassandra.io.compress.SnappyCompressor

Pig stuff:
rows = LOAD 'cql://Keyspace/CF' using CqlStorage();

I've tried to overwrite the schema by adding 'as (key: chararray, col1:
chararray, value: chararray)' but when I dump this it still looks like its
binary.

Do I need to implement my own CqlStorage() here that uncompress or am I
just missing something? I've done some googling but haven't seen anything
on the subject.  Also I am using Datastax Enterprise. 3.1. Thanks in
advance!

-m

Re: Running hadoop jobs over compressed column familes with datastatx

Posted by marlon hendred <mh...@gmail.com>.

I was able to solve the issue. There was another layer of compression
happening in the DAO that was using java.util.zip.Deflater/Inflater, along
with the snappy compression defined on the CF. The solution was to extend
CassandraStorage and override the getNext() method. The new implementation
calls super.getNext() and inflates the Tuples where appropriate.

-Marlon


On Wed, Apr 23, 2014 at 1:39 PM, marlon hendred <mh...@gmail.com> wrote:

> Hi,
>
> I'm attempting to dump a pig relation of a compressed column family. Its a
> single column whose value is a json blob. It's compressed via snappy
> compression and the value validator is BytesType. After I create the
> relation and dump I get garbage. Here is the describe:
>
> ColumnFamily: CF
>       Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
>       Default column value validator:
> org.apache.cassandra.db.marshal.BytesType
>       Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type
>       GC grace seconds: 86400
>       Compaction min/max thresholds: 2/32
>       Read repair chance: 0.1
>       DC Local Read repair chance: 0.0
>       Populate IO Cache on flush: false
>       Replicate on write: true
>       Caching: KEYS_ONLY
>       Bloom Filter FP chance: default
>       Built indexes: []
>       Compaction Strategy:
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>       Compression Options:
>         sstable_compression:
> org.apache.cassandra.io.compress.SnappyCompressor
>
> Pig stuff:
> rows = LOAD 'cql://Keyspace/CF' using CqlStorage();
>
> I've tried to overwrite the schema by adding 'as (key: chararray, col1:
> chararray, value: chararray)' but when I dump this it still looks like its
> binary.
>
> Do I need to implement my own CqlStorage() here that uncompress or am I
> just missing something? I've done some googling but haven't seen anything
> on the subject.  Also I am using Datastax Enterprise. 3.1. Thanks in
> advance!
>
> -m
>