You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Shook (JIRA)" <ji...@apache.org> on 2015/08/21 18:15:48 UTC

[jira] [Commented] (CASSANDRA-9264) Cassandra should not persist files without checksums

    [ https://issues.apache.org/jira/browse/CASSANDRA-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706942#comment-14706942 ] 

Jonathan Shook commented on CASSANDRA-9264:
-------------------------------------------

This came up in discussion with a customer today. There is effectively a difference in read response handling between data from compressed sstables vs non-compressed sstables. This is due to the fact that the block checksums on compressed sstables can disqualify corrupted data. Non-compressed sstables have no equivalent checksum mechanism, so are susceptible to passing hardware-level corruption up without detection. Sectors that have been corrupted may cause an sstable to be unreadable, but it may also manifest as an undetected change in data.





> Cassandra should not persist files without checksums
> ----------------------------------------------------
>
>                 Key: CASSANDRA-9264
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9264
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Ariel Weisberg
>             Fix For: 3.x
>
>
> Even if checksums aren't validated on the read side every time it is helpful to have them persisted with checksums so that if a corrupted file is encountered you can at least validate that the issue is corruption and not an application level error that generated a corrupt file.
> We should standardize on conventions for how to checksum a file and which checksums to use so we can ensure we get the best performance possible.
> For a small checksum I think we should use CRC32 because the hardware support appears quite good.
> For cases where a 4-byte checksum is not enough I think we can look at either xxhash64 or MurmurHash3.
> The problem with xxhash64 is that output is only 8-bytes. The problem with MurmurHash3 is that the Java implementation is slow. If we can live with 8-bytes and make it easy to switch hash implementations I think xxhash64 is a good choice because we already ship a good implementation with LZ4.
> I would also like to see hashes always prefixed by a type so that we can swap hashes without running into pain trying to figure out what hash implementation is present. I would also like to avoid making assumptions about the number of bytes in a hash field where possible keeping in mind compatibility and space issues.
> Hashing after compression is also desirable over hashing before compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)