You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/05/19 09:30:31 UTC

[GitHub] [orc] wgtmac commented on a change in pull request #701: ORC-800: Add `selectMaps` API

wgtmac commented on a change in pull request #701:
URL: https://github.com/apache/orc/pull/701#discussion_r635067609



##########
File path: c++/src/Reader.cc
##########
@@ -99,6 +99,20 @@ namespace orc {
     return result;
   }
 
+  /**
+   * Recurses over a type tree and selects the map.key and map.value of every selected map type.
+   */
+  void ColumnSelector::selectMaps(std::vector<bool>& selectedColumns, const Type& type) {
+    size_t id = static_cast<size_t>(type.getColumnId());
+    for(uint64_t c=0; c < type.getSubtypeCount(); ++c) {
+      if (type.getKind() == MAP && selectedColumns[id]) {
+        selectChildren(selectedColumns, *type.getSubtype(c));

Review comment:
       This implementation means all sub types of the MAP KEY or MAP VALUE are selected recursively. For example, if the map value type is a struct type which contains 100 sub types, we need to read all of them even the SQL query does not request so.

##########
File path: c++/src/Reader.hh
##########
@@ -96,6 +96,9 @@ namespace orc {
     // For each child of type, select it if one of its children
     // is selected.
     bool selectParents(std::vector<bool>& selectedColumns, const Type& type);
+
+    // For selected map type, select its children, key and value
+    void selectMaps(std::vector<bool>& selectedColumns, const Type& type);

Review comment:
       Does LIST type has same issue?

##########
File path: tools/test/TestMatch.cc
##########
@@ -1085,6 +1085,27 @@ TEST(TestMatch, selectColumns) {
         << "\"887336a7\"}}]}";
     EXPECT_EQ(expectedMapWithColumnId.str(), line);
 
+    // Map column #12 again, to test map key is automatically included
+    // two subtypes with column id:
+    // map<string(20),struct(21)<int1(22):int,string1(23):string>
+    cols.clear();
+    cols.push_back(22);
+    cols.push_back(23);
+    rowReaderOpts.includeTypes(cols);
+    rowReader = reader->createRowReader(rowReaderOpts);
+    c = rowReader->getSelectedColumns();
+    for (unsigned int i=1; i < c.size(); i++) {
+      if (i>=19 && i<=23)
+        EXPECT_TRUE(c[i]);
+      else
+        EXPECT_TRUE(!c[i]);
+    }
+    batch = rowReader->createRowBatch(1);
+    std::ostringstream expectedMapSchema;
+    expectedMapSchema << "Struct vector <0 of 1; Map vector <Byte vector <0 of 1>, "
+        << "Struct vector <0 of 1; Long vector <0 of 1>; Byte vector <0 of 1>; > with 0 of 1>; >";
+    EXPECT_EQ(expectedMapSchema.str(), batch->toString());

Review comment:
       I think it is valid to read only map key or map value column, or even a subset of key or value column if itself is a complex type. And you reported the bug from toString() function. Should we only fix this function to handle the case where its sub vector is null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org