You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by marlon hendred <mh...@gmail.com> on 2014/04/23 22:39:05 UTC
Running hadoop jobs over compressed column familes with datastatx
Hi,
I'm attempting to dump a pig relation of a compressed column family. Its a
single column whose value is a json blob. It's compressed via snappy
compression and the value validator is BytesType. After I create the
relation and dump I get garbage. Here is the describe:
ColumnFamily: CF
Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
Default column value validator:
org.apache.cassandra.db.marshal.BytesType
Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type
GC grace seconds: 86400
Compaction min/max thresholds: 2/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy:
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression:
org.apache.cassandra.io.compress.SnappyCompressor
Pig stuff:
rows = LOAD 'cql://Keyspace/CF' using CqlStorage();
I've tried to overwrite the schema by adding 'as (key: chararray, col1:
chararray, value: chararray)' but when I dump this it still looks like its
binary.
Do I need to implement my own CqlStorage() here that uncompress or am I
just missing something? I've done some googling but haven't seen anything
on the subject. Also I am using Datastax Enterprise. 3.1. Thanks in
advance!
-m
Re: Running hadoop jobs over compressed column familes with datastatx
Posted by marlon hendred <mh...@gmail.com>.
I was able to solve the issue. There was another layer of compression
happening in the DAO that was using java.util.zip.Deflater/Inflater, along
with the snappy compression defined on the CF. The solution was to extend
CassandraStorage and override the getNext() method. The new implementation
calls super.getNext() and inflates the Tuples where appropriate.
-Marlon
On Wed, Apr 23, 2014 at 1:39 PM, marlon hendred <mh...@gmail.com> wrote:
> Hi,
>
> I'm attempting to dump a pig relation of a compressed column family. Its a
> single column whose value is a json blob. It's compressed via snappy
> compression and the value validator is BytesType. After I create the
> relation and dump I get garbage. Here is the describe:
>
> ColumnFamily: CF
> Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
> Default column value validator:
> org.apache.cassandra.db.marshal.BytesType
> Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type
> GC grace seconds: 86400
> Compaction min/max thresholds: 2/32
> Read repair chance: 0.1
> DC Local Read repair chance: 0.0
> Populate IO Cache on flush: false
> Replicate on write: true
> Caching: KEYS_ONLY
> Bloom Filter FP chance: default
> Built indexes: []
> Compaction Strategy:
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
> Compression Options:
> sstable_compression:
> org.apache.cassandra.io.compress.SnappyCompressor
>
> Pig stuff:
> rows = LOAD 'cql://Keyspace/CF' using CqlStorage();
>
> I've tried to overwrite the schema by adding 'as (key: chararray, col1:
> chararray, value: chararray)' but when I dump this it still looks like its
> binary.
>
> Do I need to implement my own CqlStorage() here that uncompress or am I
> just missing something? I've done some googling but haven't seen anything
> on the subject. Also I am using Datastax Enterprise. 3.1. Thanks in
> advance!
>
> -m
>