You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/07/21 10:40:41 UTC

[GitHub] [incubator-doris] worker24h opened a new pull request #4136: Fixbug: json load

worker24h opened a new pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136


   1) FixBug: load a json object is failed in routine load
       for example json-data:  {"category":"11","title":"SayingsoftheCentury","price":895,"timestamp":1589191587}
   
   2) add `json_root` for nest json data
      for example json-data:
           {
           "RECORDS":[
               {"category":"11","title":"SayingsoftheCentury","price":895,"timestamp":1589191587},
               {"category":"22","author":"2avc","price":895,"timestamp":1589191487},
               {"category":"33","author":"3avc","title":"SayingsoftheCentury","timestamp":1589191387}
               ]
           }
   Routine load:
           CREATE ROUTINE LOAD example_db.test1 ON example_tbl
           COLUMNS(category, author, price, timestamp, dt=from_unixtime(timestamp, '%Y%m%d'))
           PROPERTIES
           (
               "desired_concurrent_number"="3",
               "max_batch_interval" = "20",
               "max_batch_rows" = "300000",
               "max_batch_size" = "209715200",
               "strict_mode" = "false",
               "format" = "json",
               "jsonpaths" = "[\"$.category\",\"$.author\",\"$.price\",\"$.timestamp\"]",
               "strip_outer_array" = "true",
               "json_root" = "$.RECORDS"
           )
           FROM KAFKA
           (
               "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
               "kafka_topic" = "my_topic",
               "kafka_partitions" = "0,1,2",
               "kafka_offsets" = "0,0,0"
           );
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] worker24h commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
worker24h commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r462255743



##########
File path: be/src/exec/json_scanner.h
##########
@@ -143,10 +140,12 @@ class JsonReader {
     bool _strip_outer_array;
     RuntimeProfile::Counter* _bytes_read_counter;
     RuntimeProfile::Counter* _read_timer;
+
     std::vector<std::vector<JsonPath>> _parsed_jsonpaths;
-    rapidjson::Document _json_doc;
-    //key: column name
-    std::unordered_map<std::string, JsonDataInternal> _jmap;
+    std::vector<JsonPath> _parsed_json_root;
+
+    rapidjson::Document _orinal_json_doc; // orinal json document object from parsed json string

Review comment:
       ok

##########
File path: be/src/http/http_common.h
##########
@@ -37,6 +37,7 @@ static const std::string HTTP_STRICT_MODE = "strict_mode";
 static const std::string HTTP_TIMEZONE = "timezone";
 static const std::string HTTP_EXEC_MEM_LIMIT = "exec_mem_limit";
 static const std::string HTTP_EXEC_JSONPATHS  = "jsonpaths";
+static const std::string HTTP_EXEC_JSONROOT  = "json_root";

Review comment:
       ok

##########
File path: fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadJob.java
##########
@@ -17,6 +17,14 @@
 
 package org.apache.doris.load.routineload;
 
+import com.google.common.base.Joiner;

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] imay commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
imay commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r462260814



##########
File path: docs/en/sql-reference/sql-statements/Data Manipulation/ROUTINE LOAD.md
##########
@@ -178,6 +178,9 @@ FROM data_source
     8. `strip_outer_array`
         Boolean type, true to indicate that json data starts with an array object and flattens objects in the array object, default value is false.
 
+    9. `json_root`
+        User specifies the JSON root node as the valid JSONPath string.

Review comment:
       What is the json_root default value? Please let others know

##########
File path: be/src/exec/json_scanner.cpp
##########
@@ -197,30 +201,50 @@ JsonReader::~JsonReader() {
     _close();
 }
 
-Status JsonReader::init() {
+Status JsonReader::init(const std::string& jsonpath, const std::string& json_root) {
     // parse jsonpath
+    if (!jsonpath.empty()) {
+        Status st = _generate_json_paths(jsonpath, _parsed_jsonpaths);
+        RETURN_IF_ERROR(st);
+    }
+    if (!json_root.empty()) {
+        JsonFunctions::parse_json_paths(json_root, &_parsed_json_root);
+    }
+
+    //improve performance
+    if (_parsed_jsonpaths.empty()) { // input is a simple json-string
+        _handle_json_callback = &JsonReader::_handle_simple_json;
+    } else { // input is a complex json-string and a json-path
+        if (_strip_outer_array) {
+            _handle_json_callback = &JsonReader::_handle_flat_array_complex_json;
+        } else {
+            _handle_json_callback = &JsonReader::_handle_nested_complex_json;
+        }
+    }
+    return Status::OK();
+}
+
+Status JsonReader::_generate_json_paths(const std::string& jsonpath, std::vector<std::vector<JsonPath>>& vect) {

Review comment:
       ```suggestion
   Status JsonReader::_generate_json_paths(const std::string& jsonpath, std::vector<std::vector<JsonPath>>* vect) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] worker24h commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
worker24h commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r458559252



##########
File path: be/src/exec/json_scanner.h
##########
@@ -144,9 +141,9 @@ class JsonReader {
     RuntimeProfile::Counter* _bytes_read_counter;
     RuntimeProfile::Counter* _read_timer;
     std::vector<std::vector<JsonPath>> _parsed_jsonpaths;
+    std::vector<std::vector<JsonPath>> _parsed_json_root;
     rapidjson::Document _json_doc;
-    //key: column name
-    std::unordered_map<std::string, JsonDataInternal> _jmap;
+    rapidjson::Value *_json_root;

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] worker24h commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
worker24h commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r458558969



##########
File path: be/src/exec/json_scanner.h
##########
@@ -106,33 +105,31 @@ struct JsonPath;
 class JsonReader {
 public:
     JsonReader(RuntimeState* state, ScannerCounter* counter, RuntimeProfile* profile, FileReader* file_reader,
-            std::string& jsonpath, bool strip_outer_array);
+            bool strip_outer_array);
     ~JsonReader();
 
-    Status init(); // must call before use
+    Status init(std::string& jsonpath, std::string& json_root); // must call before use

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
morningman commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r458546950



##########
File path: be/src/exec/json_scanner.h
##########
@@ -106,33 +105,31 @@ struct JsonPath;
 class JsonReader {
 public:
     JsonReader(RuntimeState* state, ScannerCounter* counter, RuntimeProfile* profile, FileReader* file_reader,
-            std::string& jsonpath, bool strip_outer_array);
+            bool strip_outer_array);
     ~JsonReader();
 
-    Status init(); // must call before use
+    Status init(std::string& jsonpath, std::string& json_root); // must call before use
 
     Status read(Tuple* tuple, const std::vector<SlotDescriptor*>& slot_descs, MemPool* tuple_pool, bool* eof);
 
 private:
+    Status (JsonReader::*_handle_json_callback)(Tuple* tuple, const std::vector<SlotDescriptor*>& slot_descs, MemPool* tuple_pool, bool* eof);
     Status _handle_simple_json(Tuple* tuple, const std::vector<SlotDescriptor*>& slot_descs, MemPool* tuple_pool, bool* eof);
-    Status _handle_complex_json(Tuple* tuple, const std::vector<SlotDescriptor*>& slot_descs, MemPool* tuple_pool, bool* eof);
     Status _handle_flat_array_complex_json(Tuple* tuple, const std::vector<SlotDescriptor*>& slot_descs, MemPool* tuple_pool, bool* eof);
     Status _handle_nested_complex_json(Tuple* tuple, const std::vector<SlotDescriptor*>& slot_descs, MemPool* tuple_pool, bool* eof);
 
     void _fill_slot(Tuple* tuple, SlotDescriptor* slot_desc, MemPool* mem_pool, const uint8_t* value, int32_t len);
-    void _assemble_jmap(const std::vector<SlotDescriptor*>& slot_descs);
+    int _assemble_jmap(const std::vector<SlotDescriptor*>& slot_descs);

Review comment:
       Remove this method

##########
File path: be/src/exec/json_scanner.cpp
##########
@@ -197,30 +201,52 @@ JsonReader::~JsonReader() {
     _close();
 }
 
-Status JsonReader::init() {
+Status JsonReader::init(std::string& jsonpath, std::string& json_root) {
     // parse jsonpath
+    if (!jsonpath.empty()) {
+        Status st = _generater_json_paths(jsonpath, _parsed_jsonpaths);

Review comment:
       ```suggestion
           Status st = _generate_json_paths(jsonpath, _parsed_jsonpaths);
   ```

##########
File path: be/src/exec/json_scanner.cpp
##########
@@ -197,30 +201,52 @@ JsonReader::~JsonReader() {
     _close();
 }
 
-Status JsonReader::init() {
+Status JsonReader::init(std::string& jsonpath, std::string& json_root) {
     // parse jsonpath
+    if (!jsonpath.empty()) {
+        Status st = _generater_json_paths(jsonpath, _parsed_jsonpaths);
+        RETURN_IF_ERROR(st);
+    }
+    if (!json_root.empty()) {
+        std::vector<JsonPath> parsed_paths;
+        JsonFunctions::parse_json_paths(json_root, &parsed_paths);
+        _parsed_json_root.push_back(parsed_paths);

Review comment:
       We only support one json root, so we can check it here, and return error if json root more than one.

##########
File path: be/src/exec/json_scanner.h
##########
@@ -106,33 +105,31 @@ struct JsonPath;
 class JsonReader {
 public:
     JsonReader(RuntimeState* state, ScannerCounter* counter, RuntimeProfile* profile, FileReader* file_reader,
-            std::string& jsonpath, bool strip_outer_array);
+            bool strip_outer_array);
     ~JsonReader();
 
-    Status init(); // must call before use
+    Status init(std::string& jsonpath, std::string& json_root); // must call before use

Review comment:
       Can these parameters be changed to const &?

##########
File path: be/src/exec/json_scanner.cpp
##########
@@ -257,26 +283,38 @@ Status JsonReader::_parse_json_doc(bool* eof) {
         delete[] json_str;
         return Status::DataQualityError(str_error.str());
     }
+    delete[] json_str;
 
-    if (_json_doc.IsArray() && !_strip_outer_array) {
-        delete[] json_str;
+    // set json root
+    if (_parsed_json_root.size() != 0) {

Review comment:
       Why _parsed_json_root is a vector?

##########
File path: be/src/exec/json_scanner.h
##########
@@ -144,9 +141,9 @@ class JsonReader {
     RuntimeProfile::Counter* _bytes_read_counter;
     RuntimeProfile::Counter* _read_timer;
     std::vector<std::vector<JsonPath>> _parsed_jsonpaths;
+    std::vector<std::vector<JsonPath>> _parsed_json_root;
     rapidjson::Document _json_doc;
-    //key: column name
-    std::unordered_map<std::string, JsonDataInternal> _jmap;
+    rapidjson::Value *_json_root;

Review comment:
       change the name of `_json_root`. This name is easily confused with `_parsed_json_root`.
   My suggestion: `_final_json_doc`. And add comment to explain that this is generated from `_json_doc`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman merged pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
morningman merged pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
morningman commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r462211888



##########
File path: be/src/exec/json_scanner.h
##########
@@ -143,10 +140,12 @@ class JsonReader {
     bool _strip_outer_array;
     RuntimeProfile::Counter* _bytes_read_counter;
     RuntimeProfile::Counter* _read_timer;
+
     std::vector<std::vector<JsonPath>> _parsed_jsonpaths;
-    rapidjson::Document _json_doc;
-    //key: column name
-    std::unordered_map<std::string, JsonDataInternal> _jmap;
+    std::vector<JsonPath> _parsed_json_root;
+
+    rapidjson::Document _orinal_json_doc; // orinal json document object from parsed json string

Review comment:
       ```suggestion
       rapidjson::Document _origin_json_doc; // orinal json document object from parsed json string
   ```

##########
File path: be/src/http/http_common.h
##########
@@ -37,6 +37,7 @@ static const std::string HTTP_STRICT_MODE = "strict_mode";
 static const std::string HTTP_TIMEZONE = "timezone";
 static const std::string HTTP_EXEC_MEM_LIMIT = "exec_mem_limit";
 static const std::string HTTP_EXEC_JSONPATHS  = "jsonpaths";
+static const std::string HTTP_EXEC_JSONROOT  = "json_root";

Review comment:
       ```suggestion
   static const std::string HTTP_JSONROOT  = "json_root";
   ```
   
   EXEC is not some kind of prfix, it is with `EXEC_MEM_LIMIT`.
   Also modify `HTTP_EXEC_JSONPATHS` to `HTTP_JSONPATHS`

##########
File path: fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadJob.java
##########
@@ -17,6 +17,14 @@
 
 package org.apache.doris.load.routineload;
 
+import com.google.common.base.Joiner;

Review comment:
       Import order




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] worker24h commented on a change in pull request #4136: Fixbug: json load

Posted by GitBox <gi...@apache.org>.
worker24h commented on a change in pull request #4136:
URL: https://github.com/apache/incubator-doris/pull/4136#discussion_r458558896



##########
File path: be/src/exec/json_scanner.cpp
##########
@@ -197,30 +201,52 @@ JsonReader::~JsonReader() {
     _close();
 }
 
-Status JsonReader::init() {
+Status JsonReader::init(std::string& jsonpath, std::string& json_root) {
     // parse jsonpath
+    if (!jsonpath.empty()) {
+        Status st = _generater_json_paths(jsonpath, _parsed_jsonpaths);
+        RETURN_IF_ERROR(st);
+    }
+    if (!json_root.empty()) {
+        std::vector<JsonPath> parsed_paths;
+        JsonFunctions::parse_json_paths(json_root, &parsed_paths);
+        _parsed_json_root.push_back(parsed_paths);

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org