You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/10/27 12:04:18 UTC

[GitHub] [doris] mmuuuua opened a new pull request, #13741: [feature](split_by_char)support split by char function

mmuuuua opened a new pull request, #13741:
URL: https://github.com/apache/doris/pull/13741

   # Proposed changes
   
   Issue Number: close #13738
   
   ## Problem summary
   
   Splits a string into substrings separated by a specified character. It uses a constant string separator which consisting of exactly one character. Returns an array of selected substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [x] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [x] Yes
       - [ ] No
       - [ ] No Need
   3. Has document been added or modified:
       - [x] Yes
       - [ ] No
       - [ ] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [x] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [x] No
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1341010147

   @zhangstar333 PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #13741: [feature](split_by_char)support split by char function

Posted by GitBox <gi...@apache.org>.
hello-stephen commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1293761350

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 38.53 seconds
    load time: 564 seconds
    storage size: 17154644814 Bytes
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221028001035_clickbench_pr_34919.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] liqing-coder closed pull request #13741: [feature](split_by_char)support split by char function

Posted by GitBox <gi...@apache.org>.
liqing-coder closed pull request #13741: [feature](split_by_char)support split by char function
URL: https://github.com/apache/doris/pull/13741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring merged pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
dataroaring merged PR #13741:
URL: https://github.com/apache/doris/pull/13741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1340304913

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] LOVEGISER commented on pull request #13741: [feature](split_by_char)support split by char function

Posted by GitBox <gi...@apache.org>.
LOVEGISER commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1296573353

   document should  supplement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1025161329


##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1159,147 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+class FunctionSplitByString : public IFunction {
+
+/**
+ * explain              :Used to split the string to get the offset and length of each element in the string
+ * parameter s          :The string to be split
+ * parameter c          :delimiter(type of string)
+ * parameter v_offset   :A container used to store the offset of each element in a string
+ * parameter v_charlen  :A container used to store the length of each element
+*/
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<size_t>& v_offset, std::vector<size_t>& v_charlen) {

Review Comment:
   use `get_offsets_and_len` instead of `getOffsetsAndLen`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1041847265


##########
be/src/vec/functions/function_string.h:
##########
@@ -1351,6 +1352,124 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+class FunctionSplitByString : public IFunction {
+public:
+    static constexpr auto name = "split_by_string";
+
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByString>(); }
+    using NullMapType = PaddedPODArray<UInt8>;
+
+    String get_name() const override { return name; }
+
+    bool is_variadic() const override { return false; }
+
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(arguments[0]));
+    }
+
+    Status execute_impl(FunctionContext* /*context*/, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t /*input_rows_count*/) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        ColumnPtr src_column =
+                block.get_by_position(arguments[0]).column->convert_to_full_column_if_const();
+        ColumnPtr delimiter_column =
+                block.get_by_position(arguments[1]).column->convert_to_full_column_if_const();
+
+        DataTypePtr src_column_type = block.get_by_position(arguments[0]).type;
+        auto dest_column_ptr = ColumnArray::create(make_nullable(src_column_type)->create_column(),
+                                                   ColumnArray::ColumnOffsets::create());
+
+        IColumn* dest_nested_column = &dest_column_ptr->get_data();
+        auto& dest_offsets = dest_column_ptr->get_offsets();
+        DCHECK(dest_nested_column != nullptr);
+        dest_nested_column->reserve(0);
+        dest_offsets.reserve(0);
+
+        NullMapType* dest_nested_null_map = nullptr;
+        if (dest_nested_column->is_nullable()) {

Review Comment:
   I will fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1340528160

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1340616815

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] zhangstar333 commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
zhangstar333 commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1041689486


##########
docs/zh-CN/docs/sql-manual/sql-functions/string-functions/split_by_string.md:
##########
@@ -0,0 +1,94 @@
+---
+{
+    "title": "split_by_string",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## split_by_string 
+
+### description
+
+#### Syntax
+
+```
+split_by_string(s, separator)
+```
+将字符串拆分为由字符串分隔的子字符串。它使用多个字符的常量字符串分隔符作为分隔符。如果字符串分隔符为空,它将字符串拆分为单个字符数组。
+
+#### Arguments
+
+`separator` — 分隔符是一个字符串,是用来分割的标志字符. 类型: `String`
+
+`s` — 需要分割的字符串. 类型: `String`
+
+#### Returned value(s)
+
+返回一个包含子字符串的数组. 以下情况会返回空的子字符串:
+
+需要分割的字符串的首尾是分隔符;
+
+多个分隔符连续出现;
+
+需要分割的字符串为空,而分隔符不为空.
+
+Type: `Array(String)`
+
+### notice
+
+`Only supported in vectorized engine`
+
+### example
+
+```
+SELECT split_by_string('1, 2 3, 4,5, abcde', ', ');
++---------------------------------------------+
+| split_by_string('1, 2 3, 4,5, abcde', ', ') |
++---------------------------------------------+
+| ['1', '2 3', '4,5', 'abcde']                |
++---------------------------------------------+
+SELECT split_by_string('abcde','');
++--------------------------------+
+| split_by_string('1,2,3,', ',') |
++--------------------------------+
+| ['a', 'b', 'c', 'd', 'e']      |
++--------------------------------+
+SELECT split_by_string(NULL,',');
++----------------------------+
+| split_by_string(NULL, ',') |
++----------------------------+
+| NULL                       |
++----------------------------+
+SELECT split_by_string('1, 2 3, , , 4,5, abcde', ', ');
++-------------------------------------------------+
+| split_by_string('1, 2 3, , , 4,5, abcde', ', ') |
++-------------------------------------------------+
+| ['1', '2 3', '', '', '4,5']                     |
++-------------------------------------------------+

Review Comment:
   this case result is not contains 'abcde'?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1340312484

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1025333709


##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1160,141 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+class FunctionSplitByString : public IFunction {
+private:
+    void get_offsets_and_len(const std::string& s, const std::string& c, std::vector<size_t>& v_offset,
+                          std::vector<size_t>& v_charlen) {
+        v_offset.clear();
+        v_charlen.clear();
+        if (c.size() == 0) {
+            for (int i = 0; i < s.size(); i++) {
+                v_offset.push_back(i);
+                v_charlen.push_back(1);
+            }
+        } else if (c.size() < s.size()) {
+            string::size_type start = 0, end = s.size() - c.size();
+            size_t end_delimiter_num = 0;
+            while (start < s.size() && start == s.find(c, start)) {
+                v_charlen.push_back(0);
+                v_offset.push_back(start);
+                start += c.size();
+            }
+            if (start > s.size() - 1) {
+                return;
+            }
+            while (start < end && end == s.find(c, end)) {
+                end_delimiter_num++;
+                end -= c.size();
+            }
+            string::size_type pos1 = start, pos2 = s.find(c, start);
+            while (pos2 < end + c.size()) {
+                v_offset.push_back(pos1);
+                v_charlen.push_back(pos2 - pos1);
+                pos1 = pos2 + c.size();
+                pos2 = s.find(c, pos1);
+            }
+            v_offset.push_back(pos1);
+            v_charlen.push_back(s.size() - end_delimiter_num * c.size() - pos1);
+
+            while (end_delimiter_num > 0) {
+                v_charlen.push_back(0);
+                v_offset.push_back(s.size() - end_delimiter_num * c.size());
+                end_delimiter_num--;
+            }
+        } else {
+            v_offset.push_back(0);
+            v_charlen.push_back(s.size());
+        }
+    }
+
+public:
+    static constexpr auto name = "split_by_string";
+
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByString>(); }
+    using NullMapType = PaddedPODArray<UInt8>;
+
+    String get_name() const override { return name; }
+
+    bool is_variadic() const override { return false; }
+
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(arguments[0]));
+    }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        ColumnPtr src_column =
+                block.get_by_position(arguments[0]).column->convert_to_full_column_if_const();
+        ColumnPtr delimiter_column =
+                block.get_by_position(arguments[1]).column->convert_to_full_column_if_const();
+
+        DataTypePtr src_column_type = block.get_by_position(arguments[0]).type;
+        auto dest_column_ptr = ColumnArray::create(make_nullable(src_column_type)->create_column(),
+                                                   ColumnArray::ColumnOffsets::create());
+
+        IColumn* dest_nested_column = &dest_column_ptr->get_data();
+        auto& dest_offsets = dest_column_ptr->get_offsets();
+        DCHECK(dest_nested_column != nullptr);
+        dest_nested_column->reserve(0);
+        dest_offsets.reserve(0);
+
+        NullMapType* dest_nested_null_map = nullptr;
+        if (dest_nested_column->is_nullable()) {
+            ColumnNullable* dest_nullable_col =
+                    reinterpret_cast<ColumnNullable*>(dest_nested_column);
+            dest_nested_column = dest_nullable_col->get_nested_column_ptr();
+            dest_nested_null_map = &dest_nullable_col->get_null_map_column().get_data();
+        }
+
+        _execute(*src_column, *delimiter_column, *dest_nested_column, dest_offsets,
+                 dest_nested_null_map);
+        block.replace_by_position(result, std::move(dest_column_ptr));
+        return Status::OK();
+    }
+
+    void _execute(const IColumn& src_column, const IColumn& delimiter_column,
+                  IColumn& dest_nested_column, ColumnArray::Offsets64& dest_offsets,
+                  NullMapType* dest_nested_null_map) {
+        ColumnString& dest_column_string = reinterpret_cast<ColumnString&>(dest_nested_column);
+        ColumnString::Chars& column_string_chars = dest_column_string.get_chars();
+        ColumnString::Offsets& column_string_offsets = dest_column_string.get_offsets();
+        column_string_chars.reserve(0);
+
+        ColumnArray::Offset64 string_pos = 0;
+        ColumnArray::Offset64 dest_pos = 0;
+        const ColumnString* src_column_string = reinterpret_cast<const ColumnString*>(&src_column);
+        ColumnArray::Offset64 src_offsets_size = src_column_string->get_offsets().size();
+
+        for (size_t i = 0; i < src_offsets_size; i++) {
+            const auto delimiter = delimiter_column.get_data_at(i).to_string();
+            const auto str = src_column_string->get_data_at(i).to_string();
+            StringRef str_ref = src_column_string->get_data_at(i);
+            if (str.size() == 0) {
+                dest_offsets.push_back(dest_pos);
+                continue;
+            }
+            vector<size_t> v_len;
+            vector<size_t> v_offset;
+            getOffsetsAndLen(str, delimiter, v_offset, v_len);

Review Comment:
   Do you mean `get_offsets_and_len`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1041690869


##########
docs/zh-CN/docs/sql-manual/sql-functions/string-functions/split_by_string.md:
##########
@@ -0,0 +1,94 @@
+---
+{
+    "title": "split_by_string",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## split_by_string 
+
+### description
+
+#### Syntax
+
+```
+split_by_string(s, separator)
+```
+将字符串拆分为由字符串分隔的子字符串。它使用多个字符的常量字符串分隔符作为分隔符。如果字符串分隔符为空,它将字符串拆分为单个字符数组。
+
+#### Arguments
+
+`separator` — 分隔符是一个字符串,是用来分割的标志字符. 类型: `String`
+
+`s` — 需要分割的字符串. 类型: `String`
+
+#### Returned value(s)
+
+返回一个包含子字符串的数组. 以下情况会返回空的子字符串:
+
+需要分割的字符串的首尾是分隔符;
+
+多个分隔符连续出现;
+
+需要分割的字符串为空,而分隔符不为空.
+
+Type: `Array(String)`
+
+### notice
+
+`Only supported in vectorized engine`
+
+### example
+
+```
+SELECT split_by_string('1, 2 3, 4,5, abcde', ', ');
++---------------------------------------------+
+| split_by_string('1, 2 3, 4,5, abcde', ', ') |
++---------------------------------------------+
+| ['1', '2 3', '4,5', 'abcde']                |
++---------------------------------------------+
+SELECT split_by_string('abcde','');
++--------------------------------+
+| split_by_string('1,2,3,', ',') |
++--------------------------------+
+| ['a', 'b', 'c', 'd', 'e']      |
++--------------------------------+
+SELECT split_by_string(NULL,',');
++----------------------------+
+| split_by_string(NULL, ',') |
++----------------------------+
+| NULL                       |
++----------------------------+
+SELECT split_by_string('1, 2 3, , , 4,5, abcde', ', ');
++-------------------------------------------------+
+| split_by_string('1, 2 3, , , 4,5, abcde', ', ') |
++-------------------------------------------------+
+| ['1', '2 3', '', '', '4,5']                     |
++-------------------------------------------------+

Review Comment:
   All code has been refactored, I will push the latest version later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] liqing-coder commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
liqing-coder commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1025208368


##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1159,147 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+class FunctionSplitByString : public IFunction {
+
+/**
+ * explain              :Used to split the string to get the offset and length of each element in the string
+ * parameter s          :The string to be split
+ * parameter c          :delimiter(type of string)
+ * parameter v_offset   :A container used to store the offset of each element in a string
+ * parameter v_charlen  :A container used to store the length of each element
+*/
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<size_t>& v_offset, std::vector<size_t>& v_charlen) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on a diff in pull request #13741: [feature](split_by_char)support split by char function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1006821953


##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;

Review Comment:
   please reformat these two lines



##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);

Review Comment:
   please remove redundant comments



##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);
+        auto col_res = ColumnArray::create(ColumnString::create());
+
+        auto& res_data = typeid_cast<ColumnString &>(col_res->get_data());
+        auto& res_offsets = col_res->get_offsets();
+
+        auto& res_data_chars = res_data.get_chars();
+        auto& res_data_offsets = res_data.get_offsets();
+
+        //auto& null_map_data = null_map->get_data();
+

Review Comment:
   please remove redundant comments, too



##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);
+        auto col_res = ColumnArray::create(ColumnString::create());
+
+        auto& res_data = typeid_cast<ColumnString &>(col_res->get_data());
+        auto& res_offsets = col_res->get_offsets();
+
+        auto& res_data_chars = res_data.get_chars();
+        auto& res_data_offsets = res_data.get_offsets();
+
+        //auto& null_map_data = null_map->get_data();
+
+        res_data_offsets.resize(input_rows_count);
+
+        /**
+         * 获得 argument参数(列数据),并存入argument_columns数组中,[0]为str,[1]为delimiter
+        */
+        size_t argument_size = arguments.size();
+        ColumnPtr argument_columns[argument_size];
+        for (size_t i = 0; i < argument_size; ++i) {
+            argument_columns[i] = block.get_by_position(arguments[i]).column->convert_to_full_column_if_const();
+            if (auto* nullable = check_and_get_column<const ColumnNullable>(*argument_columns[i])) {
+                // Danger: Here must dispose the null map data first! Because
+                // argument_columns[i]=nullable->get_nested_column_ptr(); will release the mem
+                // of column nullable mem of null map
+                VectorizedUtils::update_null_map(null_map->get_data(), nullable->get_null_map_data());
+                argument_columns[i] = nullable->get_nested_column_ptr();
+            }
+        }
+        auto str_col = assert_cast<const ColumnString*>(argument_columns[0].get());
+        auto delimiter_col = assert_cast<const ColumnString*>(argument_columns[1].get());
+        
+        /**
+         * 取出列元素中的每一行(delimiter,str),并且进行相关的操作
+        */
+        for (size_t i = 0; i < input_rows_count; ++i) {    
+            auto delimiter = delimiter_col->get_data_at(i);
+            auto delimiter_str = delimiter_col->get_data_at(i).to_string();
+            auto str = str_col->get_data_at(i);
+            auto str_str = str_col->get_data_at(i).to_string();
+            if (delimiter.size == 0) {
+                res_data_offsets[i] = res_data_chars.size();
+            } else if (delimiter.size == 1) {
+                std::vector<int> v_offset;
+                std::vector<int> v_charlen;
+                getOffsetsAndLen(str_col->get_data_at(i).to_string(), delimiter_str, v_offset, v_charlen);
+                for (size_t i = 0; i < v_offset.size(); i++) {
+                    StringOP::push_value_string1(
+                            std::string_view {
+                                    reinterpret_cast<const char*>(str.data + v_offset[i] + 1),
+                                    (size_t)v_charlen[i] - 1},
+                            i, res_data_chars, res_data_offsets);
+                    //res_data_offsets.emplace_back(v_charlen[i]);
+                }
+                res_offsets.emplace_back(v_offset.size()); 
+
+            }
+             
+        }
+        //block.replace_by_position(result, std::move(col_res));

Review Comment:
   please remove redundant comments, too



##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);
+        auto col_res = ColumnArray::create(ColumnString::create());
+
+        auto& res_data = typeid_cast<ColumnString &>(col_res->get_data());
+        auto& res_offsets = col_res->get_offsets();
+
+        auto& res_data_chars = res_data.get_chars();
+        auto& res_data_offsets = res_data.get_offsets();
+
+        //auto& null_map_data = null_map->get_data();
+
+        res_data_offsets.resize(input_rows_count);
+
+        /**
+         * 获得 argument参数(列数据),并存入argument_columns数组中,[0]为str,[1]为delimiter
+        */
+        size_t argument_size = arguments.size();
+        ColumnPtr argument_columns[argument_size];
+        for (size_t i = 0; i < argument_size; ++i) {
+            argument_columns[i] = block.get_by_position(arguments[i]).column->convert_to_full_column_if_const();
+            if (auto* nullable = check_and_get_column<const ColumnNullable>(*argument_columns[i])) {
+                // Danger: Here must dispose the null map data first! Because
+                // argument_columns[i]=nullable->get_nested_column_ptr(); will release the mem
+                // of column nullable mem of null map
+                VectorizedUtils::update_null_map(null_map->get_data(), nullable->get_null_map_data());
+                argument_columns[i] = nullable->get_nested_column_ptr();
+            }
+        }
+        auto str_col = assert_cast<const ColumnString*>(argument_columns[0].get());
+        auto delimiter_col = assert_cast<const ColumnString*>(argument_columns[1].get());
+        
+        /**
+         * 取出列元素中的每一行(delimiter,str),并且进行相关的操作
+        */
+        for (size_t i = 0; i < input_rows_count; ++i) {    
+            auto delimiter = delimiter_col->get_data_at(i);
+            auto delimiter_str = delimiter_col->get_data_at(i).to_string();
+            auto str = str_col->get_data_at(i);
+            auto str_str = str_col->get_data_at(i).to_string();
+            if (delimiter.size == 0) {
+                res_data_offsets[i] = res_data_chars.size();
+            } else if (delimiter.size == 1) {
+                std::vector<int> v_offset;
+                std::vector<int> v_charlen;
+                getOffsetsAndLen(str_col->get_data_at(i).to_string(), delimiter_str, v_offset, v_charlen);
+                for (size_t i = 0; i < v_offset.size(); i++) {
+                    StringOP::push_value_string1(
+                            std::string_view {
+                                    reinterpret_cast<const char*>(str.data + v_offset[i] + 1),
+                                    (size_t)v_charlen[i] - 1},
+                            i, res_data_chars, res_data_offsets);
+                    //res_data_offsets.emplace_back(v_charlen[i]);

Review Comment:
   please remove redundant comments, too



##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);
+        auto col_res = ColumnArray::create(ColumnString::create());
+
+        auto& res_data = typeid_cast<ColumnString &>(col_res->get_data());
+        auto& res_offsets = col_res->get_offsets();
+
+        auto& res_data_chars = res_data.get_chars();
+        auto& res_data_offsets = res_data.get_offsets();
+
+        //auto& null_map_data = null_map->get_data();
+
+        res_data_offsets.resize(input_rows_count);
+
+        /**
+         * 获得 argument参数(列数据),并存入argument_columns数组中,[0]为str,[1]为delimiter
+        */

Review Comment:
   please do not use chinese



##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);
+        auto col_res = ColumnArray::create(ColumnString::create());
+
+        auto& res_data = typeid_cast<ColumnString &>(col_res->get_data());
+        auto& res_offsets = col_res->get_offsets();
+
+        auto& res_data_chars = res_data.get_chars();
+        auto& res_data_offsets = res_data.get_offsets();
+
+        //auto& null_map_data = null_map->get_data();
+
+        res_data_offsets.resize(input_rows_count);
+
+        /**
+         * 获得 argument参数(列数据),并存入argument_columns数组中,[0]为str,[1]为delimiter
+        */
+        size_t argument_size = arguments.size();
+        ColumnPtr argument_columns[argument_size];
+        for (size_t i = 0; i < argument_size; ++i) {
+            argument_columns[i] = block.get_by_position(arguments[i]).column->convert_to_full_column_if_const();
+            if (auto* nullable = check_and_get_column<const ColumnNullable>(*argument_columns[i])) {
+                // Danger: Here must dispose the null map data first! Because
+                // argument_columns[i]=nullable->get_nested_column_ptr(); will release the mem
+                // of column nullable mem of null map
+                VectorizedUtils::update_null_map(null_map->get_data(), nullable->get_null_map_data());
+                argument_columns[i] = nullable->get_nested_column_ptr();
+            }
+        }
+        auto str_col = assert_cast<const ColumnString*>(argument_columns[0].get());
+        auto delimiter_col = assert_cast<const ColumnString*>(argument_columns[1].get());
+        
+        /**
+         * 取出列元素中的每一行(delimiter,str),并且进行相关的操作
+        */

Review Comment:
   please do not use chinese,too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Yukang-Lian commented on a diff in pull request #13741: [feature](split_by_char)support split by char function

Posted by GitBox <gi...@apache.org>.
Yukang-Lian commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1006819548


##########
be/src/vec/functions/function_string.h:
##########
@@ -1159,6 +1166,129 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+
+class FunctionSplitByChar : public IFunction {
+
+private:
+    void getOffsetsAndLen(const std::string& s, const std::string& c, std::vector<int>& v_offset, std::vector<int>& v_charlen) {
+        /**
+         * 
+         * s : string need to be split
+         * c : delimiter_string
+         * v_offset  : each word splited offset in string
+         * v_charlen : each word length in string
+        */
+        char delimiter_char = c[0];
+        int32_t pos = 0;
+	    int32_t pos_start = 0;
+	    int32_t pos_end = 0;
+        int32_t len = s.size();
+        bool flag = true;
+
+	    while (flag) {
+		    while (pos < len && s[pos] == delimiter_char) {
+			    pos++;
+                if (pos >= len - 1) {
+                    flag = false;
+                }
+            }
+
+            if (!flag || pos >= len) {
+                break;
+            }
+            pos_start = pos;
+            v_offset.emplace_back(pos_start);
+            while (pos < len && s[pos] != delimiter_char) {
+                pos++;
+            }
+            pos_end = pos;
+            v_charlen.emplace_back(pos_end - pos_start);
+        }
+    }
+public:
+    static constexpr auto name = "split_by_char";
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByChar>(); }
+    String get_name() const override { return name; }
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()));
+    }
+
+    bool use_default_implementation_for_nulls() const override { return false; }
+    bool use_default_implementation_for_constants() const override { return true; }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        auto null_map = ColumnUInt8::create(input_rows_count, 0);
+        //auto const_null_map = ColumnUInt8::create(input_rows_count, 0);
+        auto col_res = ColumnArray::create(ColumnString::create());
+
+        auto& res_data = typeid_cast<ColumnString &>(col_res->get_data());
+        auto& res_offsets = col_res->get_offsets();
+
+        auto& res_data_chars = res_data.get_chars();
+        auto& res_data_offsets = res_data.get_offsets();
+
+        //auto& null_map_data = null_map->get_data();
+
+        res_data_offsets.resize(input_rows_count);
+
+        /**
+         * 获得 argument参数(列数据),并存入argument_columns数组中,[0]为str,[1]为delimiter
+        */
+        size_t argument_size = arguments.size();
+        ColumnPtr argument_columns[argument_size];
+        for (size_t i = 0; i < argument_size; ++i) {
+            argument_columns[i] = block.get_by_position(arguments[i]).column->convert_to_full_column_if_const();
+            if (auto* nullable = check_and_get_column<const ColumnNullable>(*argument_columns[i])) {
+                // Danger: Here must dispose the null map data first! Because
+                // argument_columns[i]=nullable->get_nested_column_ptr(); will release the mem
+                // of column nullable mem of null map
+                VectorizedUtils::update_null_map(null_map->get_data(), nullable->get_null_map_data());
+                argument_columns[i] = nullable->get_nested_column_ptr();
+            }
+        }
+        auto str_col = assert_cast<const ColumnString*>(argument_columns[0].get());
+        auto delimiter_col = assert_cast<const ColumnString*>(argument_columns[1].get());
+        
+        /**
+         * 取出列元素中的每一行(delimiter,str),并且进行相关的操作
+        */
+        for (size_t i = 0; i < input_rows_count; ++i) {    
+            auto delimiter = delimiter_col->get_data_at(i);
+            auto delimiter_str = delimiter_col->get_data_at(i).to_string();
+            auto str = str_col->get_data_at(i);
+            auto str_str = str_col->get_data_at(i).to_string();
+            if (delimiter.size == 0) {
+                res_data_offsets[i] = res_data_chars.size();
+            } else if (delimiter.size == 1) {
+                std::vector<int> v_offset;
+                std::vector<int> v_charlen;
+                getOffsetsAndLen(str_col->get_data_at(i).to_string(), delimiter_str, v_offset, v_charlen);
+                for (size_t i = 0; i < v_offset.size(); i++) {
+                    StringOP::push_value_string1(
+                            std::string_view {
+                                    reinterpret_cast<const char*>(str.data + v_offset[i] + 1),
+                                    (size_t)v_charlen[i] - 1},
+                            i, res_data_chars, res_data_offsets);
+                    //res_data_offsets.emplace_back(v_charlen[i]);
+                }
+                res_offsets.emplace_back(v_offset.size()); 
+
+            }
+             
+        }
+        //block.replace_by_position(result, std::move(col_res));

Review Comment:
   here too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13741:
URL: https://github.com/apache/doris/pull/13741#issuecomment-1341932736

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1041693095


##########
be/src/vec/functions/function_string.h:
##########
@@ -1351,6 +1352,124 @@
     }
 };
 
+class FunctionSplitByString : public IFunction {
+public:
+    static constexpr auto name = "split_by_string";
+
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByString>(); }
+    using NullMapType = PaddedPODArray<UInt8>;
+
+    String get_name() const override { return name; }
+
+    bool is_variadic() const override { return false; }
+
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(arguments[0]));
+    }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t input_rows_count) override {

Review Comment:
   warning: parameter 'input_rows_count' is unused [misc-unused-parameters]
   
   ```suggestion
                           size_t result, size_t  /*input_rows_count*/) override {
   ```
   



##########
be/src/vec/functions/function_string.h:
##########
@@ -1351,6 +1352,124 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+class FunctionSplitByString : public IFunction {
+public:
+    static constexpr auto name = "split_by_string";
+
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByString>(); }
+    using NullMapType = PaddedPODArray<UInt8>;
+
+    String get_name() const override { return name; }
+
+    bool is_variadic() const override { return false; }
+
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(arguments[0]));
+    }
+
+    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,

Review Comment:
   warning: parameter 'context' is unused [misc-unused-parameters]
   
   ```suggestion
       Status execute_impl(FunctionContext*  /*context*/, Block& block, const ColumnNumbers& arguments,
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] zhangstar333 commented on a diff in pull request #13741: [feature](split_by_string)support split by string function

Posted by GitBox <gi...@apache.org>.
zhangstar333 commented on code in PR #13741:
URL: https://github.com/apache/doris/pull/13741#discussion_r1041845348


##########
be/src/vec/functions/function_string.h:
##########
@@ -1351,6 +1352,124 @@ class FunctionSplitPart : public IFunction {
     }
 };
 
+class FunctionSplitByString : public IFunction {
+public:
+    static constexpr auto name = "split_by_string";
+
+    static FunctionPtr create() { return std::make_shared<FunctionSplitByString>(); }
+    using NullMapType = PaddedPODArray<UInt8>;
+
+    String get_name() const override { return name; }
+
+    bool is_variadic() const override { return false; }
+
+    size_t get_number_of_arguments() const override { return 2; }
+
+    DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
+        return std::make_shared<DataTypeArray>(make_nullable(arguments[0]));
+    }
+
+    Status execute_impl(FunctionContext* /*context*/, Block& block, const ColumnNumbers& arguments,
+                        size_t result, size_t /*input_rows_count*/) override {
+        DCHECK_EQ(arguments.size(), 2);
+
+        ColumnPtr src_column =
+                block.get_by_position(arguments[0]).column->convert_to_full_column_if_const();
+        ColumnPtr delimiter_column =
+                block.get_by_position(arguments[1]).column->convert_to_full_column_if_const();
+
+        DataTypePtr src_column_type = block.get_by_position(arguments[0]).type;
+        auto dest_column_ptr = ColumnArray::create(make_nullable(src_column_type)->create_column(),
+                                                   ColumnArray::ColumnOffsets::create());
+
+        IColumn* dest_nested_column = &dest_column_ptr->get_data();
+        auto& dest_offsets = dest_column_ptr->get_offsets();
+        DCHECK(dest_nested_column != nullptr);
+        dest_nested_column->reserve(0);
+        dest_offsets.reserve(0);
+
+        NullMapType* dest_nested_null_map = nullptr;
+        if (dest_nested_column->is_nullable()) {

Review Comment:
   here seems always be true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org