You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/05/30 09:27:54 UTC

[GitHub] [incubator-doris] yinzhijian opened a new pull request, #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

yinzhijian opened a new pull request, #9856:
URL: https://github.com/apache/incubator-doris/pull/9856

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   Only support one level array now.
   for example:
   - nullable(array(nullable(tinyint))) is **support**.
   - nullable(array(nullable(array(xx))) is **not support**.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes)
   2. Has unit tests been added: (No)
   3. Has document been added or modified: (No Need)
   4. Does it need to update dependencies: (No)
   5. Are there any changes that cannot be rolled back: (No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman merged pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
morningman merged PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yinzhijian commented on a diff in pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
yinzhijian commented on code in PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#discussion_r890158272


##########
be/src/vec/functions/function_cast.h:
##########
@@ -1040,6 +1073,69 @@ class FunctionCast final : public IFunctionBase {
         };
     }
 
+    WrapperType create_array_wrapper(const DataTypePtr& from_type_untyped,
+                                     const DataTypeArray& to_type) const {
+        /// Conversion from String through parsing.
+        if (check_and_get_data_type<DataTypeString>(from_type_untyped.get())) {
+            return &ConvertImplGenericFromString<ColumnString>::execute;
+        }
+
+        const auto* from_type = check_and_get_data_type<DataTypeArray>(from_type_untyped.get());
+
+        if (!from_type) {
+            LOG(FATAL) << "CAST AS Array can only be performed between same-dimensional Array, "
+                          "String types";
+        }
+
+        DataTypePtr from_nested_type = from_type->get_nested_type();
+
+        /// In query SELECT CAST([] AS Array(Array(String))) from type is Array(Nothing)

Review Comment:
   handle nulls with existing cast logic, same as _string_ cast to _int_.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#issuecomment-1147475161

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] HappenLee commented on a diff in pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
HappenLee commented on code in PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#discussion_r889882792


##########
be/src/vec/utils/arrow_column_to_doris_column.cpp:
##########
@@ -236,19 +237,51 @@ static Status convert_column_with_decimal_data(const arrow::Array* array, size_t
     return Status::OK();
 }
 
+static Status convert_offset_from_list_column(const arrow::Array* array, size_t array_idx,
+                                              MutableColumnPtr& data_column, size_t num_elements,
+                                              size_t* start_idx_for_data, size_t* num_for_data) {
+    auto& offsets_data = static_cast<ColumnArray&>(*data_column).get_offsets();
+    auto concrete_array = down_cast<const arrow::ListArray*>(array);
+    auto arrow_offsets_array = concrete_array->offsets();
+    auto arrow_offsets = down_cast<arrow::Int32Array*>(arrow_offsets_array.get());
+    for (int64_t i = array_idx + 1; i < array_idx + num_elements + 1; ++i) {
+        offsets_data.emplace_back(arrow_offsets->Value(i));
+    }
+    *start_idx_for_data = arrow_offsets->Value(array_idx);
+    *num_for_data = offsets_data.back() - *start_idx_for_data;
+
+    return Status::OK();
+}
+
+static Status convert_column_with_list_data(const arrow::Array* array, size_t array_idx,
+                                            MutableColumnPtr& data_column, size_t num_elements,
+                                            const std::string& timezone) {
+    size_t start_idx_of_data = 0;
+    size_t num_of_data = 0;
+    // get start idx and num of values from arrow offsets
+    RETURN_IF_ERROR(convert_offset_from_list_column(array, array_idx, data_column, num_elements,
+                                                    &start_idx_of_data, &num_of_data));
+    auto& data_column_ptr = static_cast<ColumnArray&>(*data_column).get_data_ptr();
+    auto concrete_array = down_cast<const arrow::ListArray*>(array);
+    std::shared_ptr<arrow::Array> arrow_data = concrete_array->values();
+
+    return arrow_column_to_doris_column(arrow_data.get(), start_idx_of_data, data_column_ptr,
+                                        num_of_data, timezone);

Review Comment:
   do offset valid in this function,do not affects other processing logic which no need concern the arrow offset



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#issuecomment-1149536278

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#issuecomment-1147475207

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] HappenLee commented on a diff in pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
HappenLee commented on code in PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#discussion_r889806014


##########
be/src/vec/functions/function_cast.h:
##########
@@ -1040,6 +1073,69 @@ class FunctionCast final : public IFunctionBase {
         };
     }
 
+    WrapperType create_array_wrapper(const DataTypePtr& from_type_untyped,
+                                     const DataTypeArray& to_type) const {
+        /// Conversion from String through parsing.
+        if (check_and_get_data_type<DataTypeString>(from_type_untyped.get())) {
+            return &ConvertImplGenericFromString<ColumnString>::execute;
+        }
+
+        const auto* from_type = check_and_get_data_type<DataTypeArray>(from_type_untyped.get());
+
+        if (!from_type) {
+            LOG(FATAL) << "CAST AS Array can only be performed between same-dimensional Array, "
+                          "String types";
+        }
+
+        DataTypePtr from_nested_type = from_type->get_nested_type();
+
+        /// In query SELECT CAST([] AS Array(Array(String))) from type is Array(Nothing)

Review Comment:
   here may cast `[Array String]` to `[Array Int]`, but the `string` cast to `int` may lead to `NULL`, how to handle it?



##########
be/src/vec/utils/arrow_column_to_doris_column.cpp:
##########
@@ -236,19 +237,51 @@ static Status convert_column_with_decimal_data(const arrow::Array* array, size_t
     return Status::OK();
 }
 
+static Status convert_offset_from_list_column(const arrow::Array* array, size_t array_idx,
+                                              MutableColumnPtr& data_column, size_t num_elements,
+                                              size_t* start_idx_for_data, size_t* num_for_data) {
+    auto& offsets_data = static_cast<ColumnArray&>(*data_column).get_offsets();
+    auto concrete_array = down_cast<const arrow::ListArray*>(array);
+    auto arrow_offsets_array = concrete_array->offsets();
+    auto arrow_offsets = down_cast<arrow::Int32Array*>(arrow_offsets_array.get());
+    for (int64_t i = array_idx + 1; i < array_idx + num_elements + 1; ++i) {
+        offsets_data.emplace_back(arrow_offsets->Value(i));

Review Comment:
   the offset is offset of `arrow::ListArray`, seems not the correct offset of `data_column` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] adonis0147 commented on a diff in pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
adonis0147 commented on code in PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#discussion_r884627439


##########
be/src/vec/data_types/data_type_factory.hpp:
##########
@@ -49,22 +49,33 @@ class DataTypeFactory {
         static std::once_flag oc;
         static DataTypeFactory instance;
         std::call_once(oc, []() {
-            instance.register_data_type("UInt8", std::make_shared<DataTypeUInt8>());
-            instance.register_data_type("UInt16", std::make_shared<DataTypeUInt16>());
-            instance.register_data_type("UInt32", std::make_shared<DataTypeUInt32>());
-            instance.register_data_type("UInt64", std::make_shared<DataTypeUInt64>());
-            instance.register_data_type("Int8", std::make_shared<DataTypeInt8>());
-            instance.register_data_type("Int16", std::make_shared<DataTypeInt16>());
-            instance.register_data_type("Int32", std::make_shared<DataTypeInt32>());
-            instance.register_data_type("Int64", std::make_shared<DataTypeInt64>());
-            instance.register_data_type("Int128", std::make_shared<DataTypeInt128>());
-            instance.register_data_type("Float32", std::make_shared<DataTypeFloat32>());
-            instance.register_data_type("Float64", std::make_shared<DataTypeFloat64>());
-            instance.register_data_type("Date", std::make_shared<DataTypeDate>());
-            instance.register_data_type("DateTime", std::make_shared<DataTypeDateTime>());
-            instance.register_data_type("String", std::make_shared<DataTypeString>());
-            instance.register_data_type("Decimal",
-                                        std::make_shared<DataTypeDecimal<Decimal128>>(27, 9));
+            std::unordered_map<std::string, DataTypePtr> base_type_map {
+                    {"UInt8", std::make_shared<DataTypeUInt8>()},
+                    {"UInt16", std::make_shared<DataTypeUInt16>()},
+                    {"UInt32", std::make_shared<DataTypeUInt32>()},
+                    {"UInt64", std::make_shared<DataTypeUInt64>()},
+                    {"Int8", std::make_shared<DataTypeInt8>()},
+                    {"Int16", std::make_shared<DataTypeInt16>()},
+                    {"Int32", std::make_shared<DataTypeInt32>()},
+                    {"Int64", std::make_shared<DataTypeInt64>()},
+                    {"Int128", std::make_shared<DataTypeInt128>()},
+                    {"Float32", std::make_shared<DataTypeFloat32>()},
+                    {"Float64", std::make_shared<DataTypeFloat64>()},
+                    {"Date", std::make_shared<DataTypeDate>()},
+                    {"DateTime", std::make_shared<DataTypeDateTime>()},
+                    {"String", std::make_shared<DataTypeString>()},
+                    {"Decimal", std::make_shared<DataTypeDecimal<Decimal128>>(27, 9)},
+
+            };
+            for (auto const& [key, val] : base_type_map) {
+                instance.register_data_type(key, val);
+                instance.register_data_type("Array(" + key + ")",
+                                            std::make_shared<vectorized::DataTypeArray>(val));
+                instance.register_data_type(
+                        "Array(Nullable(" + key + "))",
+                        std::make_shared<vectorized::DataTypeArray>(
+                                std::make_shared<vectorized::DataTypeNullable>(val)));
+            }

Review Comment:
   This modification may be unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yinzhijian commented on a diff in pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
yinzhijian commented on code in PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#discussion_r886622380


##########
be/src/vec/data_types/data_type_factory.hpp:
##########
@@ -49,22 +49,33 @@ class DataTypeFactory {
         static std::once_flag oc;
         static DataTypeFactory instance;
         std::call_once(oc, []() {
-            instance.register_data_type("UInt8", std::make_shared<DataTypeUInt8>());
-            instance.register_data_type("UInt16", std::make_shared<DataTypeUInt16>());
-            instance.register_data_type("UInt32", std::make_shared<DataTypeUInt32>());
-            instance.register_data_type("UInt64", std::make_shared<DataTypeUInt64>());
-            instance.register_data_type("Int8", std::make_shared<DataTypeInt8>());
-            instance.register_data_type("Int16", std::make_shared<DataTypeInt16>());
-            instance.register_data_type("Int32", std::make_shared<DataTypeInt32>());
-            instance.register_data_type("Int64", std::make_shared<DataTypeInt64>());
-            instance.register_data_type("Int128", std::make_shared<DataTypeInt128>());
-            instance.register_data_type("Float32", std::make_shared<DataTypeFloat32>());
-            instance.register_data_type("Float64", std::make_shared<DataTypeFloat64>());
-            instance.register_data_type("Date", std::make_shared<DataTypeDate>());
-            instance.register_data_type("DateTime", std::make_shared<DataTypeDateTime>());
-            instance.register_data_type("String", std::make_shared<DataTypeString>());
-            instance.register_data_type("Decimal",
-                                        std::make_shared<DataTypeDecimal<Decimal128>>(27, 9));
+            std::unordered_map<std::string, DataTypePtr> base_type_map {
+                    {"UInt8", std::make_shared<DataTypeUInt8>()},
+                    {"UInt16", std::make_shared<DataTypeUInt16>()},
+                    {"UInt32", std::make_shared<DataTypeUInt32>()},
+                    {"UInt64", std::make_shared<DataTypeUInt64>()},
+                    {"Int8", std::make_shared<DataTypeInt8>()},
+                    {"Int16", std::make_shared<DataTypeInt16>()},
+                    {"Int32", std::make_shared<DataTypeInt32>()},
+                    {"Int64", std::make_shared<DataTypeInt64>()},
+                    {"Int128", std::make_shared<DataTypeInt128>()},
+                    {"Float32", std::make_shared<DataTypeFloat32>()},
+                    {"Float64", std::make_shared<DataTypeFloat64>()},
+                    {"Date", std::make_shared<DataTypeDate>()},
+                    {"DateTime", std::make_shared<DataTypeDateTime>()},
+                    {"String", std::make_shared<DataTypeString>()},
+                    {"Decimal", std::make_shared<DataTypeDecimal<Decimal128>>(27, 9)},
+
+            };
+            for (auto const& [key, val] : base_type_map) {
+                instance.register_data_type(key, val);
+                instance.register_data_type("Array(" + key + ")",
+                                            std::make_shared<vectorized::DataTypeArray>(val));
+                instance.register_data_type(
+                        "Array(Nullable(" + key + "))",
+                        std::make_shared<vectorized::DataTypeArray>(
+                                std::make_shared<vectorized::DataTypeNullable>(val)));
+            }

Review Comment:
   1. This map is used to convert string structures such as: (Array(Nullable(Varchar))) to a DataType structure, which is not the same as create_data_type
   
   2. Finding and returning pre-created structures from the map is currently the most efficient way to do String <=> DataType



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] adonis0147 commented on a diff in pull request #9856: [feature-wip](array-type) Add array type support for vectorized parquet-orc scanner

Posted by GitBox <gi...@apache.org>.
adonis0147 commented on code in PR #9856:
URL: https://github.com/apache/incubator-doris/pull/9856#discussion_r884627439


##########
be/src/vec/data_types/data_type_factory.hpp:
##########
@@ -49,22 +49,33 @@ class DataTypeFactory {
         static std::once_flag oc;
         static DataTypeFactory instance;
         std::call_once(oc, []() {
-            instance.register_data_type("UInt8", std::make_shared<DataTypeUInt8>());
-            instance.register_data_type("UInt16", std::make_shared<DataTypeUInt16>());
-            instance.register_data_type("UInt32", std::make_shared<DataTypeUInt32>());
-            instance.register_data_type("UInt64", std::make_shared<DataTypeUInt64>());
-            instance.register_data_type("Int8", std::make_shared<DataTypeInt8>());
-            instance.register_data_type("Int16", std::make_shared<DataTypeInt16>());
-            instance.register_data_type("Int32", std::make_shared<DataTypeInt32>());
-            instance.register_data_type("Int64", std::make_shared<DataTypeInt64>());
-            instance.register_data_type("Int128", std::make_shared<DataTypeInt128>());
-            instance.register_data_type("Float32", std::make_shared<DataTypeFloat32>());
-            instance.register_data_type("Float64", std::make_shared<DataTypeFloat64>());
-            instance.register_data_type("Date", std::make_shared<DataTypeDate>());
-            instance.register_data_type("DateTime", std::make_shared<DataTypeDateTime>());
-            instance.register_data_type("String", std::make_shared<DataTypeString>());
-            instance.register_data_type("Decimal",
-                                        std::make_shared<DataTypeDecimal<Decimal128>>(27, 9));
+            std::unordered_map<std::string, DataTypePtr> base_type_map {
+                    {"UInt8", std::make_shared<DataTypeUInt8>()},
+                    {"UInt16", std::make_shared<DataTypeUInt16>()},
+                    {"UInt32", std::make_shared<DataTypeUInt32>()},
+                    {"UInt64", std::make_shared<DataTypeUInt64>()},
+                    {"Int8", std::make_shared<DataTypeInt8>()},
+                    {"Int16", std::make_shared<DataTypeInt16>()},
+                    {"Int32", std::make_shared<DataTypeInt32>()},
+                    {"Int64", std::make_shared<DataTypeInt64>()},
+                    {"Int128", std::make_shared<DataTypeInt128>()},
+                    {"Float32", std::make_shared<DataTypeFloat32>()},
+                    {"Float64", std::make_shared<DataTypeFloat64>()},
+                    {"Date", std::make_shared<DataTypeDate>()},
+                    {"DateTime", std::make_shared<DataTypeDateTime>()},
+                    {"String", std::make_shared<DataTypeString>()},
+                    {"Decimal", std::make_shared<DataTypeDecimal<Decimal128>>(27, 9)},
+
+            };
+            for (auto const& [key, val] : base_type_map) {
+                instance.register_data_type(key, val);
+                instance.register_data_type("Array(" + key + ")",
+                                            std::make_shared<vectorized::DataTypeArray>(val));
+                instance.register_data_type(
+                        "Array(Nullable(" + key + "))",
+                        std::make_shared<vectorized::DataTypeArray>(
+                                std::make_shared<vectorized::DataTypeNullable>(val)));
+            }

Review Comment:
   This modification may be unnecessary. Refer to `DataTypeFactory::create_data_type`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org