You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jacob Perkins (JIRA)" <ji...@apache.org> on 2010/11/12 18:38:18 UTC
[jira] Created: (CASSANDRA-1737) Simplify bulk loading using the
bmt_example
Simplify bulk loading using the bmt_example
-------------------------------------------
Key: CASSANDRA-1737
URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
Project: Cassandra
Issue Type: Improvement
Components: Contrib
Affects Versions: 0.7 beta 2
Reporter: Jacob Perkins
Priority: Minor
Fix For: 0.7.0
Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (CASSANDRA-1737) Simplify bulk loading using the
bmt_example
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis resolved CASSANDRA-1737.
---------------------------------------
Resolution: Won't Fix
Fix Version/s: (was: 0.7.1)
Thanks for the patch, Jacob.
I'm closing this for now since I'm pretty sure CASSANDRA-1278 is a better way forward than BMT. Will reopen if that turns out to be a dead end.
> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
> Key: CASSANDRA-1737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
> Project: Cassandra
> Issue Type: Improvement
> Components: Contrib
> Affects Versions: 0.7 beta 2
> Reporter: Jacob Perkins
> Priority: Minor
> Attachments: cassandra_bulk_loader.tar.bz2
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1737) Simplify bulk loading using the
bmt_example
Posted by "Jacob Perkins (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jacob Perkins updated CASSANDRA-1737:
-------------------------------------
Attachment: cassandra_bulk_loader.tar.bz2
Contains a solution with readme. Not sure if this is the appropriate way to submit this.
> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
> Key: CASSANDRA-1737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
> Project: Cassandra
> Issue Type: Improvement
> Components: Contrib
> Affects Versions: 0.7 beta 2
> Reporter: Jacob Perkins
> Priority: Minor
> Fix For: 0.7.0
>
> Attachments: cassandra_bulk_loader.tar.bz2
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1737) Simplify bulk loading using the
bmt_example
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-1737:
--------------------------------------
Fix Version/s: (was: 0.7.0)
0.7.1
> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
> Key: CASSANDRA-1737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
> Project: Cassandra
> Issue Type: Improvement
> Components: Contrib
> Affects Versions: 0.7 beta 2
> Reporter: Jacob Perkins
> Priority: Minor
> Fix For: 0.7.1
>
> Attachments: cassandra_bulk_loader.tar.bz2
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1737) Simplify bulk loading using the
bmt_example
Posted by "Jacob Perkins (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932094#action_12932094 ]
Jacob Perkins commented on CASSANDRA-1737:
------------------------------------------
I wouldn't argue that it is better. Instead I've examined the use cases that we at Infochimps have (lots of data in different shapes and sizes) and rewrote the example to make a couple generic use cases as painless as possible. Those are:
a. Inserting a flat table with column names (a huge number of datasets look like this)
b. Inserting records where the column names are the fields themselves (helps with 'graph shaped' datasets)
In my experience, if you're already using Hadoop, rearranging your data to fit one of the two generic structures is more or less trivial. If your data is more complex, or is unable to fully express itself in one of these two structures, then you'll be forced to write custom code (as you would already have had to do).
As far as why this is different:
0. In general, data rearrangement and other preprocessing should be decoupled from the database loading itself.
1. This does not require a reduce step. That means, if your data is already arranged as it needs to be for insertion (a reasonable requirement I think), you can skip the costly overhead of a partition, copy, and sort on the Hadoop side of things. Less moving parts, less things to fail.
2. Implements the hadoop tool runner allowing you to pass in generic '-D' options. This includes the path to cassandra.yaml, what type of insertion, row key field, super column name field (if any), column names, as well as hadoop options such as the min split size.
3. Uses code from AbstractCassandraDaemon.java to initialize the internal node.
4. Two types of bulk loading are supported
5. Simple ruby runner for a clean interface
I'll submit the changes as a patch as soon as I am able.
> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
> Key: CASSANDRA-1737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
> Project: Cassandra
> Issue Type: Improvement
> Components: Contrib
> Affects Versions: 0.7 beta 2
> Reporter: Jacob Perkins
> Priority: Minor
> Fix For: 0.7.0
>
> Attachments: cassandra_bulk_loader.tar.bz2
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1737) Simplify bulk loading using the
bmt_example
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931861#action_12931861 ]
Jonathan Ellis commented on CASSANDRA-1737:
-------------------------------------------
Can you explain more how this is different / why this is better than the existing bmt_example?
> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
> Key: CASSANDRA-1737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
> Project: Cassandra
> Issue Type: Improvement
> Components: Contrib
> Affects Versions: 0.7 beta 2
> Reporter: Jacob Perkins
> Priority: Minor
> Fix For: 0.7.0
>
> Attachments: cassandra_bulk_loader.tar.bz2
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and possibly easier to use. Also, it should not require a reduce, especially to insert a flat table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.