You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/04/01 08:33:43 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

wgtmac opened a new pull request, #34836:
URL: https://github.com/apache/arrow/pull/34836

   ### Rationale for this change
   
   Currently the ORC CHAR type is mapped to Arrow FIXED_SIZE_BINARY type. However, the meaning of length differs between these types. The length of ORC CHAR type means number of UTF8 characters while that of Arrow FIXED_SIZE_BINARY means number of bytes.
   
   ### What changes are included in this PR?
   
   Use Arrow String type to read from ORC CHAR type.
   
   ### Are these changes tested?
   
   Make sure all tests pass.
   
   ### Are there any user-facing changes?
   
   Yes, the doc has been updated as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34836:
URL: https://github.com/apache/arrow/pull/34836#issuecomment-1492878706

   * Closes: #34823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #34836:
URL: https://github.com/apache/arrow/pull/34836#issuecomment-1493508425

   Test failure seems related to https://github.com/apache/arrow/issues/34768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34836:
URL: https://github.com/apache/arrow/pull/34836#issuecomment-1492878714

   :warning: GitHub issue #34823 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #34836:
URL: https://github.com/apache/arrow/pull/34836#issuecomment-1493928545

   Benchmark runs are scheduled for baseline = 67fbca9e565f7ba8deb5767238c2c703cd7406c1 and contender = ce94fbabae53fb53fef600f224dd44fc69b713b4. ce94fbabae53fb53fef600f224dd44fc69b713b4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/cf7edfcf21184c728a634b0ac012a5cb...813b8b6c4f944e329645e64ee08d30f0/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/26e663b23ac44060961b06f4819c3d94...44b684c84fb140009ef8d29a4d66f5f8/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7373d247568b471aae4718e16fa32b2f...b778c8500e5446eb844e925519e1602e/)
   [Failed :arrow_down:2.86% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0189f674c5754261a13b64f8c6934506...fa762e9aa44e4de0965e396ca7259ee3/)
   Buildkite builds:
   [Finished] [`ce94fbab` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2614)
   [Failed] [`ce94fbab` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2645)
   [Finished] [`ce94fbab` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2612)
   [Failed] [`ce94fbab` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2636)
   [Finished] [`67fbca9e` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2613)
   [Failed] [`67fbca9e` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2644)
   [Finished] [`67fbca9e` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2611)
   [Failed] [`67fbca9e` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2635)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on a diff in pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on code in PR #34836:
URL: https://github.com/apache/arrow/pull/34836#discussion_r1155420214


##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); col++) {

Review Comment:
   ```suggestion
     for (uint64_t col = 0; col < orc_type->getSubtypeCount(); ++col) {
   ```



##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); col++) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t i = 0; i < row_count; ++i) {

Review Comment:
   Should we use `row` instead of `i` here like the below loop?



##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); col++) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t i = 0; i < row_count; ++i) {
+      str_batch->data[i] = const_cast<char*>(data[i].c_str());
+      str_batch->length[i] = static_cast<int64_t>(data[i].size());
+    }
+  }
+  batch->numElements = row_count;
+  writer->add(*batch);
+  writer->close();
+
+  std::shared_ptr<io::RandomAccessFile> in_stream(std::make_shared<io::BufferReader>(
+      reinterpret_cast<const uint8_t*>(mem_stream.getData()),
+      static_cast<int64_t>(mem_stream.getLength())));
+  ASSERT_OK_AND_ASSIGN(
+      auto reader, adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool()));
+  ASSERT_EQ(row_count, reader->NumberOfRows());
+  ASSERT_EQ(1, reader->NumberOfStripes());
+
+  EXPECT_OK_AND_ASSIGN(auto stripe_reader, reader->NextStripeReader(row_count));
+  std::shared_ptr<RecordBatch> record_batch;
+  ASSERT_TRUE(stripe_reader->ReadNext(&record_batch).ok());
+  ASSERT_NE(record_batch, nullptr);

Review Comment:
   It seems that we use `expected, actual` order in this test.
   
   ```suggestion
     ASSERT_NE(nullptr, record_batch);
   ```



##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); col++) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t i = 0; i < row_count; ++i) {
+      str_batch->data[i] = const_cast<char*>(data[i].c_str());
+      str_batch->length[i] = static_cast<int64_t>(data[i].size());
+    }
+  }
+  batch->numElements = row_count;
+  writer->add(*batch);
+  writer->close();
+
+  std::shared_ptr<io::RandomAccessFile> in_stream(std::make_shared<io::BufferReader>(
+      reinterpret_cast<const uint8_t*>(mem_stream.getData()),
+      static_cast<int64_t>(mem_stream.getLength())));
+  ASSERT_OK_AND_ASSIGN(
+      auto reader, adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool()));
+  ASSERT_EQ(row_count, reader->NumberOfRows());
+  ASSERT_EQ(1, reader->NumberOfStripes());
+
+  EXPECT_OK_AND_ASSIGN(auto stripe_reader, reader->NextStripeReader(row_count));
+  std::shared_ptr<RecordBatch> record_batch;
+  ASSERT_TRUE(stripe_reader->ReadNext(&record_batch).ok());

Review Comment:
   Can we use `ASSERT_OK()` here?
   
   ```suggestion
     ASSERT_OK(stripe_reader->ReadNext(&record_batch));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on a diff in pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on code in PR #34836:
URL: https://github.com/apache/arrow/pull/34836#discussion_r1155441560


##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); ++col) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t row = 0; row < row_count; ++row) {
+      str_batch->data[row] = const_cast<char*>(data[row].c_str());
+      str_batch->length[row] = static_cast<int64_t>(data[row].size());
+    }
+  }
+  batch->numElements = row_count;
+  writer->add(*batch);
+  writer->close();
+
+  std::shared_ptr<io::RandomAccessFile> in_stream(std::make_shared<io::BufferReader>(
+      reinterpret_cast<const uint8_t*>(mem_stream.getData()),
+      static_cast<int64_t>(mem_stream.getLength())));
+  ASSERT_OK_AND_ASSIGN(
+      auto reader, adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool()));
+  ASSERT_EQ(row_count, reader->NumberOfRows());
+  ASSERT_EQ(1, reader->NumberOfStripes());
+
+  EXPECT_OK_AND_ASSIGN(auto stripe_reader, reader->NextStripeReader(row_count));
+  std::shared_ptr<RecordBatch> record_batch;
+  ASSERT_OK(stripe_reader->ReadNext(&record_batch));
+  ASSERT_NE(nullptr, record_batch);
+  ASSERT_EQ(row_count, record_batch->num_rows());
+
+  for (int col = 0; col < record_batch->num_columns(); col++) {
+    auto str_array = checked_pointer_cast<StringArray>(record_batch->column(col));
+    for (int row = 0; row < row_count; row++) {

Review Comment:
   ```suggestion
       for (int row = 0; row < row_count; ++row) {
   ```



##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); ++col) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t row = 0; row < row_count; ++row) {
+      str_batch->data[row] = const_cast<char*>(data[row].c_str());
+      str_batch->length[row] = static_cast<int64_t>(data[row].size());
+    }
+  }
+  batch->numElements = row_count;
+  writer->add(*batch);
+  writer->close();
+
+  std::shared_ptr<io::RandomAccessFile> in_stream(std::make_shared<io::BufferReader>(
+      reinterpret_cast<const uint8_t*>(mem_stream.getData()),
+      static_cast<int64_t>(mem_stream.getLength())));
+  ASSERT_OK_AND_ASSIGN(
+      auto reader, adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool()));
+  ASSERT_EQ(row_count, reader->NumberOfRows());
+  ASSERT_EQ(1, reader->NumberOfStripes());
+
+  EXPECT_OK_AND_ASSIGN(auto stripe_reader, reader->NextStripeReader(row_count));
+  std::shared_ptr<RecordBatch> record_batch;
+  ASSERT_OK(stripe_reader->ReadNext(&record_batch));
+  ASSERT_NE(nullptr, record_batch);
+  ASSERT_EQ(row_count, record_batch->num_rows());
+
+  for (int col = 0; col < record_batch->num_columns(); col++) {
+    auto str_array = checked_pointer_cast<StringArray>(record_batch->column(col));
+    for (int row = 0; row < row_count; row++) {
+      EXPECT_EQ(expected_data[col][row], str_array->GetString(row));
+    }
+  }
+  ASSERT_TRUE(stripe_reader->ReadNext(&record_batch).ok());

Review Comment:
   ```suggestion
     ASSERT_OK(stripe_reader->ReadNext(&record_batch));
   ```



##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); ++col) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t row = 0; row < row_count; ++row) {
+      str_batch->data[row] = const_cast<char*>(data[row].c_str());
+      str_batch->length[row] = static_cast<int64_t>(data[row].size());
+    }
+  }
+  batch->numElements = row_count;
+  writer->add(*batch);
+  writer->close();
+
+  std::shared_ptr<io::RandomAccessFile> in_stream(std::make_shared<io::BufferReader>(
+      reinterpret_cast<const uint8_t*>(mem_stream.getData()),
+      static_cast<int64_t>(mem_stream.getLength())));
+  ASSERT_OK_AND_ASSIGN(
+      auto reader, adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool()));
+  ASSERT_EQ(row_count, reader->NumberOfRows());
+  ASSERT_EQ(1, reader->NumberOfStripes());
+
+  EXPECT_OK_AND_ASSIGN(auto stripe_reader, reader->NextStripeReader(row_count));
+  std::shared_ptr<RecordBatch> record_batch;
+  ASSERT_OK(stripe_reader->ReadNext(&record_batch));
+  ASSERT_NE(nullptr, record_batch);
+  ASSERT_EQ(row_count, record_batch->num_rows());
+
+  for (int col = 0; col < record_batch->num_columns(); col++) {

Review Comment:
   ```suggestion
     for (int col = 0; col < record_batch->num_columns(); ++col) {
   ```



##########
cpp/src/arrow/adapters/orc/adapter_test.cc:
##########
@@ -453,6 +453,59 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   EXPECT_EQ(num_rows / reader_batch_size, batches);
 }
 
+TEST(TestAdapterRead, ReadCharAndVarcharType) {
+  MemoryOutputStream mem_stream(kDefaultMemStreamSize);
+  auto orc_type = liborc::Type::buildTypeFromString("struct<c1:char(6),c2:varchar(6)>");
+  auto writer = CreateWriter(/*stripe_size=*/1024, *orc_type, &mem_stream);
+
+  constexpr int64_t row_count = 2;
+  auto batch = writer->createRowBatch(row_count);
+  auto struct_batch = internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
+
+  // Verify that longer data will be truncated by ORC char and varchar types.
+  // In addition, ORC char type will pad the data with spaces if the data is shorter.
+  const std::vector<std::string> data = {"abcd", "ABCDEFGH"};
+  const std::vector<std::vector<std::string>> expected_data = {{"abcd  ", "ABCDEF"},
+                                                               {"abcd", "ABCDEF"}};
+
+  for (uint64_t col = 0; col < orc_type->getSubtypeCount(); ++col) {
+    auto str_batch =
+        internal::checked_cast<liborc::StringVectorBatch*>(struct_batch->fields[col]);
+    str_batch->hasNulls = false;
+    str_batch->numElements = row_count;
+    for (int64_t row = 0; row < row_count; ++row) {
+      str_batch->data[row] = const_cast<char*>(data[row].c_str());
+      str_batch->length[row] = static_cast<int64_t>(data[row].size());
+    }
+  }
+  batch->numElements = row_count;
+  writer->add(*batch);
+  writer->close();
+
+  std::shared_ptr<io::RandomAccessFile> in_stream(std::make_shared<io::BufferReader>(
+      reinterpret_cast<const uint8_t*>(mem_stream.getData()),
+      static_cast<int64_t>(mem_stream.getLength())));
+  ASSERT_OK_AND_ASSIGN(
+      auto reader, adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool()));
+  ASSERT_EQ(row_count, reader->NumberOfRows());
+  ASSERT_EQ(1, reader->NumberOfStripes());
+
+  EXPECT_OK_AND_ASSIGN(auto stripe_reader, reader->NextStripeReader(row_count));
+  std::shared_ptr<RecordBatch> record_batch;
+  ASSERT_OK(stripe_reader->ReadNext(&record_batch));
+  ASSERT_NE(nullptr, record_batch);
+  ASSERT_EQ(row_count, record_batch->num_rows());
+
+  for (int col = 0; col < record_batch->num_columns(); col++) {
+    auto str_array = checked_pointer_cast<StringArray>(record_batch->column(col));
+    for (int row = 0; row < row_count; row++) {
+      EXPECT_EQ(expected_data[col][row], str_array->GetString(row));
+    }
+  }
+  ASSERT_TRUE(stripe_reader->ReadNext(&record_batch).ok());
+  ASSERT_EQ(record_batch, nullptr);

Review Comment:
   ```suggestion
     ASSERT_EQ(nullptr, record_batch);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #34836:
URL: https://github.com/apache/arrow/pull/34836#issuecomment-1493030803

   No, I don't think ORC VARCHAR type is covered by any test either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou merged pull request #34836: GH-34823: [C++][ORC] Fix ORC CHAR type mapping

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou merged PR #34836:
URL: https://github.com/apache/arrow/pull/34836


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org