You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Elad Efrat <el...@innu.org> on 2015/07/07 16:29:14 UTC

Streaming data to Cassandra with Hadoop

Hello,

I'm loading data from HDFS Cassandra using Spotify's hdfs2cass. The
setup is a 4-node cluster running Cassandra 2.1.6, RF=2, STCS, raw
data size is about 1tb before loading and 3.8tb after. The process
works fine, but I do have a few questions.

1. Some Hadoop jobs fail due to streaming timeouts. That's fine,
because subsequent attempts succeed, but why do I get the timeouts in
the first place? Would this be something network-related or does
Cassandra have a limit on how much streaming it can handle?

2. The server logs show errors like the one quoted below, for
"malformed input around byte N" --

    ERROR [STREAM-IN-/10.84.30.209] 2015-07-06 11:30:10,915
StreamSession.java:499 - [Stream
#e1e4f470-23fb-11e5-9c95-9b249a189cad] Streaming error occurred
    java.io.UTFDataFormatException: malformed input around byte 10
    at java.io.DataInputStream.readUTF(DataInputStream.java:656) ~[na:1.7.0_67]
    at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.7.0_67]
    at org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:143)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:120)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:42)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]

Is this a familiar issue? I'd expect the data to be the same across
all streaming attempts. The timeouts I can theorize about, any
thoughts on what might be causing these though? Is it normal?

3. About compaction. There's a RESTful service in front of Cassandra
and I see the average response time is positively correlated with the
number of compactions pending (it drops as they drop). Is there a way
to stream such that the number of compactions once the streaming is
done is minimal?

4. Also about compaction: I understand that while STCS is
write-optimized and reduces the number of SSTables, LCS is
read-optimized and might increase it. The aforementioned service needs
read-only access to Cassandra. Loading with LCS resulted in an order
of magnitude more compactions and dramatically higher server load.
Given I want minimal response time ASAP, what approach should I be
taking? Right now I load with STCS, wait for compactions to finish,
and I consider a switch to LCS once it's done. Does it make sense? Any
thoughts on improving this process? (Ideally - is there anything close
to a one-shot process where compaction is barely required?)

I'll gladly provide additional information if needed. I'll also be
happy to hear about others' experience in similar scenarios.

Thanks,

Elad