You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ross M (JIRA)" <ji...@apache.org> on 2010/02/28 08:17:05 UTC

[jira] Created: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
---------------------------------------------------------------------------------

                 Key: CASSANDRA-836
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: n/a - all
            Reporter: Ross M
            Priority: Minor


CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 

the following code:

    /** writes header at the beginning of the file, then seeks back to current position */
    void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
    {
        long currentPos = logWriter.getFilePointer();
        logWriter.seek(0);

        writeCommitLogHeader(bytes);

        logWriter.seek(currentPos);
    }

works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.

i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.

the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.

a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839491#action_12839491 ] 

Jonathan Ellis commented on CASSANDRA-836:
------------------------------------------

commitlog header doesn't need to be very efficient either on space or time; it's a tiny amount of data that is only modified when a memtable flushes.  See http://wiki.apache.org/cassandra/ArchitectureCommitLog

everything but the header (i.e., the first couple hundred bytes) in the commitlog is serialized row mutations, which do not involve either bitset or jdk serialization.

if you want to try to optimize CL serialization, you should focus on RowMutation, not commitlogheader.


> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839705#action_12839705 ] 

Jonathan Ellis commented on CASSANDRA-836:
------------------------------------------

That's only for bitsets where the size hasn't been explicitly requested (ours always is).

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Ross M (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ross M reopened CASSANDRA-836:
------------------------------


it's already a bug even if i "don't do that."

java's serialization of the BitSet makes no promise about the size of the serialization (and it can't since it's a variable-size object.) the code is relying on behavior that isn't promised, and therefore can break with updates to java, switches to other runtime, or even adding more flags to the set. the code is currently lukcy, probably because the bitset only has 5 values, if it was 9 the relied upon behavior may not be the case.

it also prevents you from being able to improve the serialization of BitSet. it's used in a lot of places in the logs and data files (which is why i was looking at it in the first place.)

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Ross M (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839495#action_12839495 ] 

Ross M commented on CASSANDRA-836:
----------------------------------

i wasn't looking to improve the header, obviously it's just appears once. i just saw a lot of java.whatever.* class names and started looking at them. BitSet was the first thing i ran across (it also appears in the LocationInfo, perhaps for boomfilters) and coding a "better" version for the particular case only took about 10 minutes and seemed like a reasonable way to get acquainted with the code. i made the changes, verified they were working, but cassandra blew up on 2nd startup when it tired to allocate more memory (i assume in to a byte buffer) than the system would allow. that lead me to track down the problem (this bug-report) my implementation of BitSetSeriliaizer was valid/correct, but cassandra blew up. CommitLogHeader is relying on behavior that isn't implicit. if that behavior is going to be relied upon then it should be documented in BitSetSerializer and anything else CommitLogHeader uses. that said it's not good behavior to depend on since the most optimal implementation of BSS may not be constant size.





> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839498#action_12839498 ] 

Jonathan Ellis commented on CASSANDRA-836:
------------------------------------------

also, your serializer uses an int (=32 bits) for each set index in the set rather than 1 bit per index.  in other words, if the set is more than 1/32 full, yours will do worse than the jdk's.

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-836.
--------------------------------------

    Resolution: Invalid

it's not a bug, because we never change the size of the bitset

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Ross M (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839501#action_12839501 ] 

Ross M commented on CASSANDRA-836:
----------------------------------

i'm not a moron. that was an example attached to the bug that quickly caused the problem to happen.

that said it actually is smaller than the java serialization of the BitSet {2, 4} of length 5. with ints 5, 2, 2, 4 stored (128 bits) vs the 
 "java.util.BitSet" portion of java serialization which takes up 128 bits then you have at least 64-bits for the actual bit values plus anything else it sticks in there. there apparently are other things b/c the default version results in ~592 bits of data. even with all 5 bits turned on it would take 224 with the int version.

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Ross M (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ross M updated CASSANDRA-836:
-----------------------------

    Attachment: BitSetSerializer.java
                BitSetSerializer.java

this version of BitSetSerializer illustrates the problem. if you start up a server with it, write ~100 items. ctrl-c the server and then start it back up it will likely blow up when parsing the commit log or at least fail crcs. 

if you turn logging up you should see the inital bitset header write with {4} and a latter header write with {2, 4} that new bit uses an extra int in this version of the serializer and thus overwrites data (since the header grows.)

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Ross M (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ross M updated CASSANDRA-836:
-----------------------------

    Attachment:     (was: BitSetSerializer.java)

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839479#action_12839479 ] 

Jonathan Ellis edited comment on CASSANDRA-836 at 2/28/10 7:20 PM:
-------------------------------------------------------------------

it's not a bug, because we never change the size of the bitset.

if you want to add an assertion to that effect, fine, but making the serialization handle a situation that would be a horrible bug, is bad design.

      was (Author: jbellis):
    it's not a bug, because we never change the size of the bitset
  
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Ross M (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839485#action_12839485 ] 

Ross M commented on CASSANDRA-836:
----------------------------------

changing the size of the bitset is only one of the ways things could break. it's relying on the java BitSet::serialize to stay the same size. if http://java.sun.com/j2se/1.4.2/docs/api/serialized-form.html#java.util.BitSet is set in stone and all jdk impls have to implement it the same way it might be ok. if not things could break.

beyond that i was looking at trying to improve the on-disk size of things and this blocks me from being able to improve the size of BitSet's with a custom BitSetSerializer. currently a large percentage of the data is repetitive serialization info (java classes etc.) java serialization is not very space efficient. just take a good size commit log with varied data and gzip it, i was seeing ~10x size reduction. if you reduce the amount of data being written to the disk each write will take less time... 

if a header is going to be re-written like this it would be better to have the CommitLogHeader write all of it's data out on it's own so that it can make sure to use a fixed size rather than relying on the underlying implementations.


> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-836.
--------------------------------------

    Resolution: Invalid

if header size changes you've corrupted the rest of the commitlog, so don't do that.

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-836) CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839702#action_12839702 ] 

Gary Dusbabek commented on CASSANDRA-836:
-----------------------------------------

I came across this while I've been working on CASSANDRA-44.  BitSet defines its own serialization methods (readObject, writeObject) that have an optimization that affects the overall size of the serialization.  Basically, if any words at the end of the BitSet are zero, they are lopped off of the serialization, meaning that the size of the serialization depends on which bits are flipped.

> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't change.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-836
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-836
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: n/a - all
>            Reporter: Ross M
>            Priority: Minor
>         Attachments: BitSetSerializer.java
>
>
> CommitLogSegment::seekAndWriteCommitLogHeader assumes header size doesn't grow. there are pieces of the header (BitSet) that are serialized with java serialization which makes no such promises. 
> the following code:
>     /** writes header at the beginning of the file, then seeks back to current position */
>     void seekAndWriteCommitLogHeader(byte[] bytes) throws IOException
>     {
>         long currentPos = logWriter.getFilePointer();
>         logWriter.seek(0);
>         writeCommitLogHeader(bytes);
>         logWriter.seek(currentPos);
>     }
> works fine as long as the header size doesn't change, but if it grows the new header will over write the beginning of the data segment. the bit-set being written in the header happens to serialize to the same size, but there is no guarantee of this.
> i found this when looking at optimizing the serialization of data to disk (thus improving write throughput/performance.) i removed the ObjectOutputStream serialization in BitSetSerializer and replaced it with a custom serialization that omits the generic java serialization/ObjectOutputStream stuff and just writes on the "true" bits. the custom serialization worked fine, but broke other parts of the code when the header bitset had new bits turned on, thus growing the header's size, data segment bytes were overwritten.
> the serialized version of a BitSet can grow in a similar manner, no pomises of size/consistency are made, but with current use it luckily doesn't seem to happen.
> a good fix is unclear. without forcing the header to be a fixed/constant size in some manner this problem could pop up at any point. it's generally not safe to rewrite headers like this without custom code that ensures the size doesn't change. one fix would be to manually write all of the header data out (rather than relying on java serialization and serialization code in other parts of cassandra not to change.) another might be to pad the size of the header so that the data inside can grow, but that seems fraught with (potential) problems. (i've played around with padding the header length, but that seems to cause other things to break, which i haven't been able to track down yet.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.