You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jacob Perkins (JIRA)" <ji...@apache.org> on 2010/11/12 18:38:18 UTC

[jira] Created: (CASSANDRA-1737) Simplify bulk loading using the bmt_example

Simplify bulk loading using the bmt_example
-------------------------------------------

                 Key: CASSANDRA-1737
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
             Project: Cassandra
          Issue Type: Improvement
          Components: Contrib
    Affects Versions: 0.7 beta 2
            Reporter: Jacob Perkins
            Priority: Minor
             Fix For: 0.7.0


Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (CASSANDRA-1737) Simplify bulk loading using the bmt_example

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-1737.
---------------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.7.1)

Thanks for the patch, Jacob.

I'm closing this for now since I'm pretty sure CASSANDRA-1278 is a better way forward than BMT.  Will reopen if that turns out to be a dead end.

> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
>                 Key: CASSANDRA-1737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.7 beta 2
>            Reporter: Jacob Perkins
>            Priority: Minor
>         Attachments: cassandra_bulk_loader.tar.bz2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1737) Simplify bulk loading using the bmt_example

Posted by "Jacob Perkins (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacob Perkins updated CASSANDRA-1737:
-------------------------------------

    Attachment: cassandra_bulk_loader.tar.bz2

Contains a solution with readme. Not sure if this is the appropriate way to submit this.

> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
>                 Key: CASSANDRA-1737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.7 beta 2
>            Reporter: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: cassandra_bulk_loader.tar.bz2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1737) Simplify bulk loading using the bmt_example

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1737:
--------------------------------------

    Fix Version/s:     (was: 0.7.0)
                   0.7.1

> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
>                 Key: CASSANDRA-1737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.7 beta 2
>            Reporter: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.7.1
>
>         Attachments: cassandra_bulk_loader.tar.bz2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1737) Simplify bulk loading using the bmt_example

Posted by "Jacob Perkins (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932094#action_12932094 ] 

Jacob Perkins commented on CASSANDRA-1737:
------------------------------------------

I wouldn't argue that it is better. Instead I've examined the use cases that we at Infochimps have (lots of data in different shapes and sizes) and rewrote the example to make a couple generic use cases as painless as possible. Those are:

a. Inserting a flat table with column names (a huge number of datasets look like this)
b. Inserting records where the column names are the fields themselves (helps with 'graph shaped' datasets)

In my experience, if you're already using Hadoop, rearranging your data to fit one of the two generic structures is more or less trivial. If your data is more complex, or is unable to fully express itself in one of these two structures, then you'll be forced to write custom code (as you would already have had to do).

As far as why this is different:

0. In general, data rearrangement and other preprocessing should be decoupled from the database loading itself.

1. This does not require a reduce step. That means, if your data is already arranged as it needs to be for insertion (a reasonable requirement I think), you can skip the costly overhead of a partition, copy, and sort on the Hadoop side of things. Less moving parts, less things to fail.

2. Implements the hadoop tool runner allowing you to pass in generic '-D' options. This includes the path to cassandra.yaml, what type of insertion, row key field, super column name field (if any), column names, as well as hadoop options such as the min split size.

3. Uses code from AbstractCassandraDaemon.java to initialize the internal node.

4. Two types of bulk loading are supported

5. Simple ruby runner for a clean interface


I'll submit the changes as a patch as soon as I am able.

> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
>                 Key: CASSANDRA-1737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.7 beta 2
>            Reporter: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: cassandra_bulk_loader.tar.bz2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1737) Simplify bulk loading using the bmt_example

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931861#action_12931861 ] 

Jonathan Ellis commented on CASSANDRA-1737:
-------------------------------------------

Can you explain more how this is different / why this is better than the existing bmt_example?

> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
>                 Key: CASSANDRA-1737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.7 beta 2
>            Reporter: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: cassandra_bulk_loader.tar.bz2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.