You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Samarth Gahire (Created) (JIRA)" <ji...@apache.org> on 2012/02/22 07:18:48 UTC

[jira] [Created] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
------------------------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-3943
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
             Project: Cassandra
          Issue Type: Improvement
          Components: Hadoop, Tools
    Affects Versions: 0.8.2, 1.1.0
            Reporter: Samarth Gahire
            Assignee: Brandon Williams
            Priority: Minor
             Fix For: 1.1.0


When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
But After loading , sstables created in the cluster nodes are of size around
{code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}

As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3943:
--------------------------------------

    Affects Version/s: 1.1.0
    
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Assignee: Stu Hood
>            Priority: Minor
>              Labels: bulkloader, hadoop, ponies, sstableloader, streaming, tools
>             Fix For: 1.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214808#comment-13214808 ] 

Brandon Williams commented on CASSANDRA-3943:
---------------------------------------------

bq. As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases

This is unavoidable, since the sstable now contains more ranges that belong to specific replica ranges in the ring.

bq. Such small size sstables take too much time to compact (minor compaction)

Assuming SizeTieredStrategy, increasing the maximum threshold may help this to some degree, so that the nodes compact more tiny sstables at a time.

bq. Is there any solution to this in existing versions or are you fixing this in future version?

I'm open to ideas, but have no plans as there is no clear solution.
                
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Assignee: Brandon Williams
>            Priority: Minor
>              Labels: bulkloader, hadoop, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Brandon Williams (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-3943:
----------------------------------------

    Reviewer:   (was: jbellis)
    Assignee:     (was: Brandon Williams)
      Labels: bulkloader hadoop ponies sstableloader streaming tools  (was: bulkloader hadoop sstableloader streaming tools)
    
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Priority: Minor
>              Labels: bulkloader, hadoop, ponies, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235437#comment-13235437 ] 

Peter Schuller commented on CASSANDRA-3943:
-------------------------------------------

We are working on generating single-range-per-reducer sstables so that there is no overlap, and each reducer can send to a single node (or at least one node per sstable generated). It doesn't address local storage, but does address this.

It also has the effect that if we combine it with log(n) filtering of sstables in the read path based on ranges, it would be feasable to bulk import and have thousands of sstables and completely disable compaction.

                
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Priority: Minor
>              Labels: bulkloader, hadoop, ponies, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Stu Hood (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-3943:
--------------------------------

    Assignee:     (was: Stu Hood)

I talked to Peter after my initial comment, and he's right: we'd actually prefer to dump in many small sstables _if-and-only-if_ they can be correctly incorporated into the interval tree such that they don't trigger a ton of compaction. This would allow the bulk insert to warm up incrementally as it is loaded, rather than all at once, as it would with one large sstable.

But unfortunately, work commitments mean I won't be able to start investigating this anytime soon, so I'm un-assigning it.
                
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Priority: Minor
>              Labels: bulkloader, hadoop, sstableloader, streaming, tools
>             Fix For: 1.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Brandon Williams (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-3943:
----------------------------------------

    Fix Version/s:     (was: 1.1.0)
       Issue Type: Task  (was: Improvement)
    
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Task
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Assignee: Brandon Williams
>            Priority: Minor
>              Labels: bulkloader, hadoop, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3943:
--------------------------------------

    Affects Version/s:     (was: 1.1.0)
        Fix Version/s: 1.2
             Assignee: Stu Hood
    
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Assignee: Stu Hood
>            Priority: Minor
>              Labels: bulkloader, hadoop, ponies, sstableloader, streaming, tools
>             Fix For: 1.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235440#comment-13235440 ] 

Peter Schuller commented on CASSANDRA-3943:
-------------------------------------------

It also facilitates replacing a data set one sstable at a time (if one generates sstables that correspond exactly in ranges), allowing completely replacement of a dataset without a temporary disk space spike.

Without any of these fixes, extra disk space needed is very significant - both regular compaction overhead in addition to loading two data sets onto the node.
                
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Priority: Minor
>              Labels: bulkloader, hadoop, ponies, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Stu Hood (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235427#comment-13235427 ] 

Stu Hood commented on CASSANDRA-3943:
-------------------------------------

I'd like to work on this, as it seems like it should be possible to make a BulkOutputFormat that writes at most one sstable from each reducer to each host, if you sort the data by token before it arrives at the reducer. Essentially, the OutputFormat would assert that it was receiving the data in sorted order, and write it straight to the socket as an sstable data file: this has been the 'lifelong dream' of MapReduce integration.
                
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Priority: Minor
>              Labels: bulkloader, hadoop, ponies, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3943) Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.

Posted by "Brandon Williams (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-3943:
----------------------------------------

    Issue Type: Wish  (was: Task)
    
> Too many small size sstables after loading data using sstableloader or BulkOutputFormat increases compaction time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3943
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3943
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Hadoop, Tools
>    Affects Versions: 0.8.2, 1.1.0
>            Reporter: Samarth Gahire
>            Assignee: Brandon Williams
>            Priority: Minor
>              Labels: bulkloader, hadoop, sstableloader, streaming, tools
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When we create sstables using SimpleUnsortedWriter or BulkOutputFormat,the size of sstables created is around the buffer size provided.
> But After loading , sstables created in the cluster nodes are of size around
> {code}( (sstable_size_before_loading) * replication_factor ) / No_Of_Nodes_In_Cluster{code}
> As the no of nodes in cluster goes increasing, size of each sstable loaded to cassandra node decreases.Such small size sstables take too much time to compact (minor compaction) as compare to relatively large size sstables.
> One solution that we have tried is to increase the buffer size while generating sstables.But as we increase the buffer size ,time taken to generate sstables increases.Is there any solution to this in existing versions or are you fixing this in future version?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira