You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benoit Perroud (JIRA)" <ji...@apache.org> on 2011/09/05 22:23:16 UTC
[jira] [Created] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call
to ColumnFamily.serializedSize iterate through the whole columns
SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
-------------------------------------------------------------------------------------------------
Key: CASSANDRA-3141
URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
Project: Cassandra
Issue Type: Improvement
Components: Core
Affects Versions: 0.8.3
Reporter: Benoit Perroud
Priority: Minor
Every time newRow is called, serializedSize iterate through all the columns to compute the size.
Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3141) SSTableSimpleUnsortedWriter
call to ColumnFamily.serializedSize iterate through the whole columns
Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098993#comment-13098993 ]
Sylvain Lebresne commented on CASSANDRA-3141:
---------------------------------------------
If we want to be precise, this doesn't work correctly. In the sense that if you add a column and there is already an existing column with the same name, this won't compute the serialized size correctly.
Now we could say that this doesn't matter much in the sense that
# If you use SSTSUW in cases where you update the same column a lot, you're probably doing it wrong.
# Even when that happens, the consequence is that you will 'flush to disk' more often than you would otherwise. Which ain't necessarily a big deal.
# It is an estimation anyway
That being said, I wonder if this call to serializedSize() is really that costly. Maybe it adds a little bit of cost when you 'reopen' a row multiple times, but you are not supposed to do that too much really (if ever).
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3141
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Priority: Minor
> Fix For: 0.8.6
>
> Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call
to ColumnFamily.serializedSize iterate through the whole columns
Posted by "Sylvain Lebresne (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne resolved CASSANDRA-3141.
-----------------------------------------
Resolution: Not A Problem
Fix Version/s: (was: 0.8.8)
Ok, closing this for now. If someone has evidence there is a real need for optimization here he can reopen.
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3141
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Priority: Minor
> Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3141) SSTableSimpleUnsortedWriter
call to ColumnFamily.serializedSize iterate through the whole columns
Posted by "Benoit Perroud (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100449#comment-13100449 ]
Benoit Perroud commented on CASSANDRA-3141:
-------------------------------------------
Iterating through a list of 1'000'000 of elements takes obviously time.
But I agree with both of you :
- it's a premature optimization, I will try with CASSANDRA-2843 first
- the way I use SSTSUW is not completely appropriate, I have way better result doing one key after the other.
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3141
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Priority: Minor
> Fix For: 0.8.6
>
> Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call
to ColumnFamily.serializedSize iterate through the whole columns
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-3141:
--------------------------------------
Reviewer: slebresne
Fix Version/s: 0.8.6
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3141
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Priority: Minor
> Fix For: 0.8.6
>
> Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3141) SSTableSimpleUnsortedWriter
call to ColumnFamily.serializedSize iterate through the whole columns
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098998#comment-13098998 ]
Jonathan Ellis commented on CASSANDRA-3141:
-------------------------------------------
It does feel a little like premature optimization to me, in the absence of profiler data showing this is a major expense. When I've looked at similar profiling before, the actual serialization was much more of a bottleneck than serializedSize.
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3141
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Priority: Minor
> Fix For: 0.8.6
>
> Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3141) SSTableSimpleUnsortedWriter call
to ColumnFamily.serializedSize iterate through the whole columns
Posted by "Benoit Perroud (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Perroud updated CASSANDRA-3141:
--------------------------------------
Attachment: CachedSizeCF.patch
PoC for CF computing serialized size when adding a new Column instead of computing it everytimes CF.serializedSize is called
> SSTableSimpleUnsortedWriter call to ColumnFamily.serializedSize iterate through the whole columns
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3141
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3141
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Priority: Minor
> Attachments: CachedSizeCF.patch
>
>
> Every time newRow is called, serializedSize iterate through all the columns to compute the size.
> Once 1'000'000 columns exist in the CF, it becomes painfull to do at every iteration the same computation. Caching the size and incrementing when a Column is added could be an option.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira