You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benoit Perroud (JIRA)" <ji...@apache.org> on 2011/09/05 22:23:16 UTC

[jira] [Created] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
-------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-3141
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.8.3
            Reporter: Benoit Perroud
            Priority: Minor


Every time newRow is called, serializedSize iterate through all the columns to compute the size.

Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098993#comment-13098993 ] 

Sylvain Lebresne commented on CASSANDRA-3141:
---------------------------------------------

If we want to be precise, this doesn't work correctly. In the sense that if you add a column and there is already an existing column with the same name, this won't compute the serialized size correctly.

Now we could say that this doesn't matter much in the sense that
  # If you use SSTSUW in cases where you update the same column a lot, you're probably doing it wrong.
  # Even when that happens, the consequence is that you will 'flush to disk' more often than you would otherwise. Which ain't necessarily a big deal.
  # It is an estimation anyway

That being said, I wonder if this call to serializedSize() is really that costly. Maybe it adds a little bit of cost when you 'reopen' a row multiple times, but you are not supposed to do that too much really (if ever).

> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3141
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Priority: Minor
>             Fix For: 0.8.6
>
>         Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

Posted by "Sylvain Lebresne (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne resolved CASSANDRA-3141.
-----------------------------------------

       Resolution: Not A Problem
    Fix Version/s:     (was: 0.8.8)

Ok, closing this for now. If someone has evidence there is a real need for optimization here he can reopen.
                
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3141
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Priority: Minor
>         Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

Posted by "Benoit Perroud (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100449#comment-13100449 ] 

Benoit Perroud commented on CASSANDRA-3141:
-------------------------------------------

Iterating through a list of 1'000'000 of elements takes obviously time. 

But I agree with both of you : 
- it's a premature optimization, I will try with CASSANDRA-2843 first
- the way I use SSTSUW is not completely appropriate, I have way better result doing one key after the other.


> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3141
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Priority: Minor
>             Fix For: 0.8.6
>
>         Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3141:
--------------------------------------

         Reviewer: slebresne
    Fix Version/s: 0.8.6

> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3141
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Priority: Minor
>             Fix For: 0.8.6
>
>         Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098998#comment-13098998 ] 

Jonathan Ellis commented on CASSANDRA-3141:
-------------------------------------------

It does feel a little like premature optimization to me, in the absence of profiler data showing this is a major expense.  When I've looked at similar profiling before, the actual serialization was much more of a bottleneck than serializedSize.

> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3141
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Priority: Minor
>             Fix For: 0.8.6
>
>         Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns

Posted by "Benoit Perroud (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Perroud updated CASSANDRA-3141:
--------------------------------------

    Attachment: CachedSizeCF.patch

PoC for CF computing serialized size when adding a new Column instead of computing it everytimes CF.serializedSize is called

> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3141
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Priority: Minor
>         Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira