You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2023/01/18 07:19:57 UTC

[GitHub] [doris] SaintBacchus opened a new pull request, #16055: [feature](Load)Suppot skip specific lines number for csv stream load

SaintBacchus opened a new pull request, #16055:
URL: https://github.com/apache/doris/pull/16055

   # Proposed changes
   
   One subtask of [DSIP-028](https://cwiki.apache.org/confluence/display/DORIS/DSIP-028%3A+Suppot+MySQL+Load+Data)
   
   ## Problem summary
   
   Support set skip line number for stream load to load csv file.
   
   Usage `-H skip_lines:number`:
   ```
   curl --location-trusted -u root: -T test.csv -H skip_lines:5  -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load
   ```
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [ ] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [ ] Yes
       - [ ] No
       - [ ] No Need
   3. Has document been added or modified:
       - [ ] Yes
       - [ ] No
       - [ ] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [ ] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [ ] No
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1409787272

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1407588267

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman merged pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman merged PR #16055:
URL: https://github.com/apache/doris/pull/16055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1407669523

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1411340298

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1409787203

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1409640340

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1411547843

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1386605062

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1407600963

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1404769900

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] SaintBacchus commented on a diff in pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by "SaintBacchus (via GitHub)" <gi...@apache.org>.
SaintBacchus commented on code in PR #16055:
URL: https://github.com/apache/doris/pull/16055#discussion_r1091350517


##########
be/src/vec/exec/format/csv/csv_reader.cpp:
##########
@@ -88,14 +88,18 @@ CsvReader::~CsvReader() = default;
 Status CsvReader::init_reader(bool is_load) {
     // set the skip lines and start offset
     int64_t start_offset = _range.start_offset;
-    if (start_offset == 0 && _params.__isset.file_attributes &&
-        _params.file_attributes.__isset.header_type &&
-        _params.file_attributes.header_type.size() > 0) {
-        std::string header_type = to_lower(_params.file_attributes.header_type);
-        if (header_type == BeConsts::CSV_WITH_NAMES) {
-            _skip_lines = 1;
-        } else if (header_type == BeConsts::CSV_WITH_NAMES_AND_TYPES) {
-            _skip_lines = 2;
+    if (start_offset == 0) {
+        // check header typer first
+        if (_params.__isset.file_attributes && _params.file_attributes.__isset.header_type &&
+            _params.file_attributes.header_type.size() > 0) {
+            std::string header_type = to_lower(_params.file_attributes.header_type);
+            if (header_type == BeConsts::CSV_WITH_NAMES) {
+                _skip_lines = 1;
+            } else if (header_type == BeConsts::CSV_WITH_NAMES_AND_TYPES) {
+                _skip_lines = 2;
+            }
+        } else if (_params.file_attributes.__isset.skip_lines) {

Review Comment:
   No need to check `_params.__isset.file_attributes` in my PR since it's will alwasy be set in fe code.
   But for the compatibility,  I think it's better to check for any optional field in the thrift.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1407255276

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1404540626

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1386600272

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on a diff in pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman commented on code in PR #16055:
URL: https://github.com/apache/doris/pull/16055#discussion_r1090766395


##########
docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md:
##########
@@ -180,6 +180,8 @@ ERRORS:
 
 25. trim_double_quotes: 布尔类型,默认值为 false,为 true 时表示裁剪掉 csv 文件每个字段最外层的双引号。
 
+26. skip_lines: <version since="1.2" type="inline"> 整数类型, 默认值为0, 含义为跳过csv文件的前几行. 当设置format设置为csv_with_names或、csv_with_names_and_types时, 该参数会失效. </version>

Review Comment:
   ```suggestion
   26. skip_lines: <version since="dev" type="inline"> 整数类型, 默认值为0, 含义为跳过csv文件的前几行. 当设置format设置为 `csv_with_names` 或、`csv_with_names_and_types` 时, 该参数会失效. </version>
   ```



##########
be/src/vec/exec/format/csv/csv_reader.cpp:
##########
@@ -88,14 +88,18 @@ CsvReader::~CsvReader() = default;
 Status CsvReader::init_reader(bool is_load) {
     // set the skip lines and start offset
     int64_t start_offset = _range.start_offset;
-    if (start_offset == 0 && _params.__isset.file_attributes &&
-        _params.file_attributes.__isset.header_type &&
-        _params.file_attributes.header_type.size() > 0) {
-        std::string header_type = to_lower(_params.file_attributes.header_type);
-        if (header_type == BeConsts::CSV_WITH_NAMES) {
-            _skip_lines = 1;
-        } else if (header_type == BeConsts::CSV_WITH_NAMES_AND_TYPES) {
-            _skip_lines = 2;
+    if (start_offset == 0) {
+        // check header typer first
+        if (_params.__isset.file_attributes && _params.file_attributes.__isset.header_type &&
+            _params.file_attributes.header_type.size() > 0) {
+            std::string header_type = to_lower(_params.file_attributes.header_type);
+            if (header_type == BeConsts::CSV_WITH_NAMES) {
+                _skip_lines = 1;
+            } else if (header_type == BeConsts::CSV_WITH_NAMES_AND_TYPES) {
+                _skip_lines = 2;
+            }
+        } else if (_params.file_attributes.__isset.skip_lines) {

Review Comment:
   Need to check `_params.__isset.file_attributes`?



##########
docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md:
##########
@@ -183,6 +183,8 @@ ERRORS:
 
 25. trim_double_quotes: Boolean type, The default value is false. True means that the outermost double quotes of each field in the csv file are trimmed.
 
+26. skip_lines: <version since="1.2" type="inline"> Integer type, the default value is 0. It will skip some lines in the head of csv file. It will be disable when format is csv_with_names or csv_with_names_and_types. </version>

Review Comment:
   ```suggestion
   26. skip_lines: <version since="dev" type="inline"> Integer type, the default value is 0. It will skip some lines in the head of csv file. It will be disabled when format is `csv_with_names` or `csv_with_names_and_types`. </version>
   ```



##########
fe/fe-core/src/main/cup/sql_parser.cup:
##########
@@ -621,7 +621,8 @@ terminal String
     KW_AUTO,
     KW_PREPARE,
     KW_EXECUTE,
-    KW_LINES;
+    KW_LINES,
+    KW_IGNORE;

Review Comment:
   Need to add `KW_IGNORE` to the `keywords ::=` entry



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1409741011

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1386634579

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by GitBox <gi...@apache.org>.
hello-stephen commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1386922085

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 36.05 seconds
    load time: 553 seconds
    storage size: 17122659918 Bytes
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230118114439_clickbench_pr_83113.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #16055: [feature](Load)Suppot skip specific lines number for csv stream load

Posted by github-actions.
github-actions[bot] commented on PR #16055:
URL: https://github.com/apache/doris/pull/16055#issuecomment-1399489165

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org