You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/05/18 15:22:43 UTC

[GitHub] [orc] czxrrr opened a new pull request #701: [ORC-800] add selectMaps()

czxrrr opened a new pull request #701:
URL: https://github.com/apache/orc/pull/701


   MapVectorBatch::toString will encounter segment fault if map.value is selected while map.key is not selected. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635317164



##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       The bug does not only exist in ReaderImpl::toString() and it exists in ReaderImpl::getMemoryUse() ALSO.
   Once key or value missing, even rootBatch->getMemoryUse() will fail. 
   In my opinion, If people really wanna querying keys or values independently, I think struct of array of keys and array of values is more suitable, they do not need map. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr removed a comment on pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr removed a comment on pull request #701:
URL: https://github.com/apache/orc/pull/701#issuecomment-843388179


   I Added a testcase. Without this commit, toString() function call will segment fault.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] wgtmac commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
wgtmac commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635067609



##########
File path: c++/src/Reader.cc
##########
@@ -99,6 +99,20 @@ namespace orc {
     return result;
   }
 
+  /**
+   * Recurses over a type tree and selects the map.key and map.value of every selected map type.
+   */
+  void ColumnSelector::selectMaps(std::vector<bool>& selectedColumns, const Type& type) {
+    size_t id = static_cast<size_t>(type.getColumnId());
+    for(uint64_t c=0; c < type.getSubtypeCount(); ++c) {
+      if (type.getKind() == MAP && selectedColumns[id]) {
+        selectChildren(selectedColumns, *type.getSubtype(c));

Review comment:
       This implementation means all sub types of the MAP KEY or MAP VALUE are selected recursively. For example, if the map value type is a struct type which contains 100 sub types, we need to read all of them even the SQL query does not request so.

##########
File path: c++/src/Reader.hh
##########
@@ -96,6 +96,9 @@ namespace orc {
     // For each child of type, select it if one of its children
     // is selected.
     bool selectParents(std::vector<bool>& selectedColumns, const Type& type);
+
+    // For selected map type, select its children, key and value
+    void selectMaps(std::vector<bool>& selectedColumns, const Type& type);

Review comment:
       Does LIST type has same issue?

##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       I think it is valid to read only map key or map value column, or even a subset of key or value column if itself is a complex type. And you reported the bug from toString() function. Should we only fix this function to handle the case where its sub vector is null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on pull request #701:
URL: https://github.com/apache/orc/pull/701#issuecomment-843402656


   Testcase was just attached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635317164



##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       The bug does not only exist in MapVectorBatch::toString() and it exists in MapVectorBatch::getMemoryUse() ALSO.
   Once key or value missing, even rootBatch->getMemoryUsage() will fail. 
   In my opinion, If people really wanna querying keys or values independently, Struct of array of keys and array of values is more suitable, they do not need map. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #701: [ORC-800] add selectMaps()

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #701:
URL: https://github.com/apache/orc/pull/701#issuecomment-843319154


   +1 for @wgtmac 's comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635317164



##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       The bug does not only exist in ReaderImpl::toString() and it exists in ReaderImpl::getMemoryUse() ALSO.
   Once key or value missing, even rootBatch->getMemoryUse() will fail. 
   In my opinion, If people really wanna querying keys or values independently, Struct of array of keys and array of values is more suitable, they do not need map. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635330192



##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       But letting people query part of map would be more general way to solve it. Let me try fix it in this way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635317164



##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       The bug does not only exist in MapVectorBatch::toString() and it exists in MapVectorBatch::getMemoryUse() ALSO.
   Once key or value missing, even rootBatch->getMemoryUse() will fail. 
   In my opinion, If people really wanna querying keys or values independently, Struct of array of keys and array of values is more suitable, they do not need map. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on a change in pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635306365



##########
File path: c++/src/Reader.hh
##########
@@ -96,6 +96,9 @@ namespace orc {
     // For each child of type, select it if one of its children
     // is selected.
     bool selectParents(std::vector<bool>& selectedColumns, const Type& type);
+
+    // For selected map type, select its children, key and value
+    void selectMaps(std::vector<bool>& selectedColumns, const Type& type);

Review comment:
       List won't have such issues, because list only has one child. Parent and Child have to be both selected or both unselected.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] wgtmac merged pull request #701: ORC-800: Add `key` and `value` nullness check in `MapVectorBatch`

Posted by GitBox <gi...@apache.org>.
wgtmac merged pull request #701:
URL: https://github.com/apache/orc/pull/701


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] czxrrr commented on pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
czxrrr commented on pull request #701:
URL: https://github.com/apache/orc/pull/701#issuecomment-843388179


   I Added a testcase. Without this commit, toString() function call will segment fault.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #701: ORC-800: Add `selectMaps` API

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #701:
URL: https://github.com/apache/orc/pull/701#issuecomment-843405954


   Thank you, @czxrrr 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org