You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "Xiangdong Huang (Jira)" <ji...@apache.org> on 2021/08/12 08:42:00 UTC
[jira] [Commented] (IOTDB-842) Better Import-CSV Tool

    [ https://issues.apache.org/jira/browse/IOTDB-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397931#comment-17397931 ] 

Xiangdong Huang commented on IOTDB-842:
---------------------------------------

New User Interface Design:


# CSV Tool

The CSV tool can help you import data in CSV format to IoTDB or export data from IoTDB to a CSV file.

## Usage of export-csv.sh

### Syntax

```shell
# Unix/OS X
> tools/export-csv.sh  -h <ip> -p <port> -u <username> -pw <password> -td <directory> [-tf <time-format> -datatype <true/false> -q <query command> -s <sql file>]

# Windows
> tools\export-csv.bat -h <ip> -p <port> -u <username> -pw <password> -td <directory> [-tf <time-format> -datatype <true/false> -q <query command> -s <sql file>]
```

Description:

* `-datatype`:
    - true (by default): print the data type of timesries in the head line of CSV file. i.e., `Time, root.sg1.d1.s1 (int), root.sg1.d1.s2(long)`.
    - false: only print the timeseries name in the head line of the CSV file. i.e., `Time, root.sg1.d1.s1 , root.sg1.d1.s2`
* `-q <query command>`:
    - specifying a query command that you want to execute
    - example: `select * from root limit 100`, or `select * from root limit 100 align by device`
* `-s <sql file>`:
    - specifying a SQL file which can consist of more than one sql. If there are multiple SQLs in one SQL file, the SQLs should be separated by line breaks. And, for each SQL, a output CSV file will be generated.
* `-td <directory>`:
    - specifying  the directory that the data will be exported
* `-tf <time-format>`:
    - specifying a time format that you want. The time format have to obey [ISO 8601](https://calendars.wikia.org/wiki/ISO_8601) standard. If you want to save the time as the timestamp, then setting `-tf timestamp`
    - example: `-tf yyyy-MM-dd\ HH:mm:ss` or `-tf timestamp`

More, if you don't use one of `-s` and `-q`, you need to enter some queries after running the export script. The results of the different query will be saved to different CSV files.

### example

```shell
# Unix/OS X
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./
# Or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf yyyy-MM-dd\ HH:mm:ss
# or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -q "select * from root"
# Or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -s sql.txt
# Or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf yyyy-MM-dd\ HH:mm:ss -s sql.txt

# Windows
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./
# Or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf yyyy-MM-dd\ HH:mm:ss
# or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -q "select * from root"
# Or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -s sql.txt
# Or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf yyyy-MM-dd\ HH:mm:ss -s sql.txt
```

### Sample SQL file

```sql
select * from root;
select * from root align by device;
```

The result of `select * from root`

```sql
Time,root.ln.wf04.wt04.status (BOOLEAN),root.ln.wf03.wt03.hardware (TEXT),root.ln.wf02.wt02.status (BOOLEAN),root.ln.wf02.wt02.hardware (TEXT),root.ln.wf01.wt01.hardware (TEXT),root.ln.wf01.wt01.status (BOOLEAN)
1970-01-01T08:00:00.001+08:00,true,"v1",true,"v1",v1,true
1970-01-01T08:00:00.002+08:00,true,"v1",,,,true
```

The result of `select * from root align by device`

```sql
Time,Device,hardware (TEXT),status (BOOLEAN)
1970-01-01T08:00:00.001+08:00,root.ln.wf01.wt01,"v1",true
1970-01-01T08:00:00.002+08:00,root.ln.wf01.wt01,,true
1970-01-01T08:00:00.001+08:00,root.ln.wf02.wt02,"v1",true
1970-01-01T08:00:00.001+08:00,root.ln.wf03.wt03,"v1",
1970-01-01T08:00:00.002+08:00,root.ln.wf03.wt03,"v1",
1970-01-01T08:00:00.001+08:00,root.ln.wf04.wt04,,true
1970-01-01T08:00:00.002+08:00,root.ln.wf04.wt04,,true
```

The data of boolean type signed by `true` and `false` without double quotes. And the text data will be enclosed in double quotes.

### Note

Note that if fields exported by the export tool have the following special characters:

1. `,`: the field will be enclosed by `"`.
2. `"`: the field will be enclosed by `"` and the original characters `"` in the field will be replaced by `\"`.

## Usage of import-csv.sh

### Create metadata (optional)

```sql
SET STORAGE GROUP TO root.fit.d1;
SET STORAGE GROUP TO root.fit.d2;
SET STORAGE GROUP TO root.fit.p;
CREATE TIMESERIES root.fit.d1.s1 WITH DATATYPE=INT32,ENCODING=RLE;
CREATE TIMESERIES root.fit.d1.s2 WITH DATATYPE=TEXT,ENCODING=PLAIN;
CREATE TIMESERIES root.fit.d2.s1 WITH DATATYPE=INT32,ENCODING=RLE;
CREATE TIMESERIES root.fit.d2.s3 WITH DATATYPE=INT32,ENCODING=RLE;
CREATE TIMESERIES root.fit.p.s1 WITH DATATYPE=INT32,ENCODING=RLE;
```

IoTDB has the ability of type inference, so it is not necessary to create metadata before data import. However, we still recommend creating metadata before importing data using the CSV import tool, as this can avoid unnecessary type conversion errors.

### Sample CSV file to be imported

```sql
Time,root.fit.d1.s1,root.fit.d1.s2,root.fit.d2.s1,root.fit.d2.s3,root.fit.p.s1
1,100,"hello",200,300,400
2,500,"world",600,700,800
3,900,"hello, \"world\"",1000,1100,1200
```

### Syntax

```shell
# Unix/OS X
> tools/import-csv.sh -h <ip> -p <port> -u <username> -pw <password> -f <xxx.csv> [-fd <yyy.csv> -policy <1/2> -table-aligned <time/device>]

# Windows
> tools\import-csv.bat -h <ip> -p <port> -u <username> -pw <password> -f <xxx.csv> [-fd <yyy.csv> -policy <1/2> -table-aligned <time/device>]
```

Description:

* `-f`:
  - the CSV file that you want to import
  - example: `-f filename.csv`

 * `-fd`:
   - export0.csv.failed (by default): specifying a file to save failed lines
   - example: `-fd failed-filename.csv`
* `-policy`:
  - 1 (by default): you can directly use the this policy, if the CSV file has defined the data type of timeseries in the head of CSV file.  i.e., `Time, root.sg1.d1.s1 (int), root.sg1.d1.s2(long)`.
  - 2: make the program do type inference
* `-table-aligned`
  - `time` (by default): the data that you want to import like the result of `select * from root`
  - `device`: the data that you want to import like the result of `select * from root align by device`

### Example

```sh
# Unix/OS X
> tools/import-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -f example-filename.csv -fd failed-filename.csv
# or
> tools/import-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -f example-filename.csv -fd failed-filename.csv -policy 2
# or
> tools/import-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -f example-filename.csv -fd failed-filename.csv -policy 2
-table-aligned device

# Windows
> tools\import-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -f example-filename.csv
# or
> tools/import-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -f example-filename.csv -fd failed-filename.csv -policy 2
# or
> tools/import-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -f example-filename.csv -fd failed-filename.csv -policy 2
-table-aligned device
```

### Note

Note that the following special characters in fields need to be checked before importing:

1. `,` : fields containing `,` should be quoted by a pair of `"` or a pair of `'`.
2. `"` : `"` in fields should be replaced by `\"` or fields should be enclosed by `'`.
3. `'` : `'` in fields should be replaced by `\'` or fields should be enclosed by `"`.
4. you can input time format like `yyyy-MM-dd'T'HH:mm:ss`, `yyy-MM-dd HH:mm:ss`, or `yyyy-MM-dd'T'HH:mm:ss.SSSZ`.


> Better Import-CSV Tool
> ----------------------
>
>                 Key: IOTDB-842
>                 URL: https://issues.apache.org/jira/browse/IOTDB-842
>             Project: Apache IoTDB
>          Issue Type: Task
>          Components: Tools/Others
>            Reporter: Xiangdong Huang
>            Priority: Minor
>
> Hi, our import-csv tool is currently implemented by JDBC and requires a fossil format:
> e.g., 
> {code:java}
> Time,root.sg.d1.s1,root.sg.d1.s2,root.sg.d2.s1,root.sg.d2.s2,root.sg.d2.s3
> 2020-08-18T10:22:31.603+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:35.631+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:41.093+08:00,null,null,1,2.0,null
> 2020-08-18T10:22:52.603+08:00,null,null,1,2.0,true
> {code}
> Requirement 1:
> As we support 3 kinds of output format: align all series (by default), align by device, without alignment, it is better to support such 3 kinds of import-csv format:
> a. 
> {code:java}
> Time,root.sg.d1.s1,root.sg.d1.s2,root.sg.d2.s1,root.sg.d2.s2,root.sg.d2.s3
> 2020-08-18T10:22:31.603+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:35.631+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:41.093+08:00,null,null,1,2.0,null
> 2020-08-18T10:22:52.603+08:00,null,null,1,2.0,true
> {code}
> b. 
> {code:java}
> Time,Device,s1,s2,s3
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,2.0,null
> 2020-08-18T10:22:35.631+08:00,root.sg.d1,1,2.0,null
> 2020-08-18T10:22:41.093+08:00,root.sg.d2,1,2.0,null
> 2020-08-18T10:22:52.603+08:00,root.sg.d2,1,2.0,true
> {code}
> c.
> (it is strange, I'd like to do not support such format.)
> Requment2:
> Different users may have different time formats for the first column.
> So, we'd better support different kinds of time format. e.g., let users define how to parse their timestamp: yyyy-MM-ddHH:mm:ss.SSS etc..
> Requirement 3:
> Support NULL as well as empty char to describe the null data point. For example, the following  3 lines are the same:
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,null,null
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,,
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,    ,
> Requirement 4:
> Support claiming the storage group name once rather than repeat the storage group name for each line:
> e.g., for format b, we can tell the tool the sg is `root.sg` and then each row looks like:
> 2020-08-18T10:22:35.631+08:00,d1,1,2.0,null
> Another option is add a new column called storage_group for each row.
> For UT:
> 1. all data type should be covered;
> 2. incorrect csv format should be covered;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)