You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeremy Hanna (JIRA)" <ji...@apache.org> on 2015/05/01 20:14:07 UTC

[jira] [Comment Edited] (CASSANDRA-9048) Delimited File Bulk Loader

    [ https://issues.apache.org/jira/browse/CASSANDRA-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523551#comment-14523551 ] 

Jeremy Hanna edited comment on CASSANDRA-9048 at 5/1/15 6:14 PM:
-----------------------------------------------------------------

The loader tool fulfills two purposes.

- One is a black box with a lot of [options|https://github.com/brianmhess/cassandra-loader?files=1#options].  These options can be added to COPY FROM over time such as rate limiting, number of threads, boolean/date styles, number of retries, whether the delimiter can be embedded in quotes, and so forth.  Decent logging of the process to know what didn't succeed and why would also be helpful.

- The second purpose is to give people some example code to do asynchronous loading on their own.

This will continue to live on and evolve separately I imagine, but it would be nice to merge in a lot of the options.


was (Author: jeromatron):
The loader tool fulfills two purposes.

- One is a black box with a lot of [options|https://github.com/brianmhess/cassandra-loader?files=1#options].  These options can be added to COPY FROM over time such as rate limiting, number of threads, boolean/date styles, number of retries, whether the delimiter can be embedded in quotes, and so forth.  Decent logging of the process to know what didn't succeed and why would also be helpful.  It would be nice to have this built-in and the best tool for the job would be great.

- The second purpose is to give people some example code to do asynchronous loading on their own.

This will continue to live on and evolve separately I imagine, but it would be nice to merge in a lot of the options.

> Delimited File Bulk Loader
> --------------------------
>
>                 Key: CASSANDRA-9048
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9048
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter:  Brian Hess
>         Attachments: CASSANDRA-9048.patch
>
>
> There is a strong need for bulk loading data from delimited files into Cassandra.  Starting with delimited files means that the data is not currently in the SSTable format, and therefore cannot immediately leverage Cassandra's bulk loading tool, sstableloader, directly.
> A tool supporting delimited files much closer matches the format of the data more often than the SSTable format itself, and a tool that loads from delimited files is very useful.
> In order for this bulk loader to be more generally useful to customers, it should handle a number of options at a minimum:
> - support specifying the input file or to read the data from stdin (so other command-line programs can pipe into the loader)
> - supply the CQL schema for the input data
> - support all data types other than collections (collections is a stretch goal/need)
> - an option to specify the delimiter
> - an option to specify comma as the decimal delimiter (for international use casese)
> - an option to specify how NULL values are specified in the file (e.g., the empty string or the string NULL)
> - an option to specify how BOOLEAN values are specified in the file (e.g., TRUE/FALSE or 0/1)
> - an option to specify the Date and Time format
> - an option to skip some number of rows at the beginning of the file
> - an option to only read in some number of rows from the file
> - an option to indicate how many parse errors to tolerate
> - an option to specify a file that will contain all the lines that did not parse correctly (up to the maximum number of parse errors)
> - an option to specify the CQL port to connect to (with 9042 as the default).
> Additional options would be useful, but this set of options/features is a start.
> A word on COPY.  COPY comes via CQLSH which requires the client to be the same version as the server (e.g., 2.0 CQLSH does not work with 2.1 Cassandra, etc).  This tool should be able to connect to any version of Cassandra (within reason).  For example, it should be able to handle 2.0.x and 2.1.x.  Moreover, CQLSH's COPY command does not support a number of the options above.  Lastly, the performance of COPY in 2.0.x is not high enough to be considered a bulk ingest tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)