You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 03:35:03 UTC

[GitHub] [arrow] mathyingzhou opened a new pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

mathyingzhou opened a new pull request #10157:
URL: https://github.com/apache/arrow/pull/10157


   Not sure whether this is what Caleb Winston and Jeremy Dyer have been looking for but at least the (user) metadata reader has been exposed in C++ and Python.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r631554343



##########
File path: python/pyarrow/_orc.pyx
##########
@@ -22,16 +22,19 @@
 from cython.operator cimport dereference as deref
 from libcpp.vector cimport vector as std_vector
 from libcpp.utility cimport move
+from libcpp.memory cimport const_pointer_cast

Review comment:
       Ah sorry I forgot to remove it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] calebwin commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
calebwin commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-826821786


   Hi @mathyingzhou,
   
   Thanks for working on this and this looks great. I think the main things we wanted were schema, stripe-level statistics, and even row-level statistics (lower priority than stripe-level).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-840392456


   @pitrou Thanks for the comments! Please review again. None of the failures has anything to do with my PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-827283131


   @calebwin Really thanks! For schema there is already ORCFile.schema(). For the statistics we will need to add the adapter to some other ORC functions which will take a while. I suddenly got very busy at work so I will need 4 extra weeks before this can be delivered. Sorry for the inconvenience..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou edited a comment on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou edited a comment on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-834501015


   @pitrou Done! Please review again.
   P.S. The R error does not seem to be related to my code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r629332737



##########
File path: python/pyarrow/includes/common.pxd
##########
@@ -34,6 +34,7 @@ cimport cpython
 
 cdef extern from * namespace "std" nogil:
     cdef shared_ptr[T] static_pointer_cast[T, U](shared_ptr[U])
+    cdef shared_ptr[T] const_pointer_cast[T, U](shared_ptr[U])

Review comment:
       This doesn't seem used?

##########
File path: cpp/src/arrow/adapters/orc/adapter.cc
##########
@@ -229,6 +229,17 @@ class ORCFileReader::Impl {
     return GetArrowSchema(type, out);
   }
 
+  Result<std::shared_ptr<KeyValueMetadata>> ReadMetadata() {
+    std::list<std::string> keys = reader_->getMetadataKeys();

Review comment:
       I meant the list of keys :-)

##########
File path: python/pyarrow/_orc.pyx
##########
@@ -86,14 +107,14 @@ cdef class ORCReader(_Weakrefable):
             int64_t stripe
             std_vector[int] indices
 
-        stripe = n
+        stripe=n

Review comment:
       Why are you making all these changes?

##########
File path: python/pyarrow/orc.py
##########
@@ -26,11 +26,13 @@
 
 
 def _is_map(typ):
-    return (types.is_list(typ) and
-            types.is_struct(typ.value_type) and
-            typ.value_type.num_fields == 2 and
-            typ.value_type[0].name == 'key' and
-            typ.value_type[1].name == 'value')
+    return (
+        types.is_list(typ) and
+        types.is_struct(typ.value_type) and
+        typ.value_type.num_fields == 2 and
+        typ.value_type[0].name == "key" and
+        typ.value_type[1].name == "value"
+    )

Review comment:
       Hmm... can you avoid making unrelated style changes? If your editor is configured to auto-reformat Python code, you may want to disable that.

##########
File path: python/pyarrow/_orc.pyx
##########
@@ -22,16 +22,19 @@
 from cython.operator cimport dereference as deref
 from libcpp.vector cimport vector as std_vector
 from libcpp.utility cimport move
+from libcpp.memory cimport const_pointer_cast

Review comment:
       This doesn't seem used?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r628072354



##########
File path: cpp/src/arrow/adapters/orc/adapter_test.cc
##########
@@ -289,6 +289,8 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   constexpr uint64_t reader_batch_size = 1024;
 
   auto writer = CreateWriter(stripe_size, *type, &mem_stream);
+  // writer->addUserMetadata("key1", "value1");
+  // writer->addUserMetadata("key2", "value2");

Review comment:
       Sorry I should have gone over the code to remove all of them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10157:
URL: https://github.com/apache/arrow/pull/10157


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r628071675



##########
File path: cpp/src/arrow/adapters/orc/adapter.h
##########
@@ -149,7 +156,7 @@ class ARROW_EXPORT ORCFileWriter {
   /// \brief Creates a new ORC writer.
   ///
   /// \param[in] output_stream a pointer to the io::OutputStream to write into
-  /// \return the returned writer object
+  /// \return the returned writer object or Status if error occurs

Review comment:
       Sure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-826478175


   https://issues.apache.org/jira/browse/ARROW-9299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r628133835



##########
File path: python/pyarrow/tests/test_orc.py
##########
@@ -110,20 +114,20 @@ def check_example_file(orc_path, expected_df, need_fix=False):
     json_pos = 0
     for i in range(orc_file.nstripes):
         batch = orc_file.read_stripe(i)
-        check_example_values(pd.DataFrame(batch.to_pydict()),
-                             expected_df,
-                             start=json_pos,
-                             stop=json_pos + len(batch))
+        check_example_values(
+            pd.DataFrame(batch.to_pydict()),
+            expected_df,
+            start=json_pos,
+            stop=json_pos + len(batch),
+        )

Review comment:
       Sure!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r628071458



##########
File path: cpp/src/arrow/adapters/orc/adapter.h
##########
@@ -27,6 +27,7 @@
 #include "arrow/status.h"
 #include "arrow/type.h"
 #include "arrow/type_fwd.h"
+#include "arrow/util/key_value_metadata.h"

Review comment:
       Ah I see..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-834501015


   @pitrou Done! Please review again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-826478484


   @pitrou Please review. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r631553536



##########
File path: python/pyarrow/orc.py
##########
@@ -26,11 +26,13 @@
 
 
 def _is_map(typ):
-    return (types.is_list(typ) and
-            types.is_struct(typ.value_type) and
-            typ.value_type.num_fields == 2 and
-            typ.value_type[0].name == 'key' and
-            typ.value_type[1].name == 'value')
+    return (
+        types.is_list(typ) and
+        types.is_struct(typ.value_type) and
+        typ.value_type.num_fields == 2 and
+        typ.value_type[0].name == "key" and
+        typ.value_type[1].name == "value"
+    )

Review comment:
       Sure..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r628071066



##########
File path: cpp/src/arrow/adapters/orc/adapter.cc
##########
@@ -229,6 +229,17 @@ class ORCFileReader::Impl {
     return GetArrowSchema(type, out);
   }
 
+  Result<std::shared_ptr<KeyValueMetadata>> ReadMetadata() {
+    std::list<std::string> keys = reader_->getMetadataKeys();
+    auto metadata = std::make_shared<KeyValueMetadata>();
+    if (!keys.empty()) {

Review comment:
       Sure!

##########
File path: cpp/src/arrow/adapters/orc/adapter.cc
##########
@@ -229,6 +229,17 @@ class ORCFileReader::Impl {
     return GetArrowSchema(type, out);
   }
 
+  Result<std::shared_ptr<KeyValueMetadata>> ReadMetadata() {
+    std::list<std::string> keys = reader_->getMetadataKeys();

Review comment:
       Yup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mathyingzhou commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
mathyingzhou commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-826478484


   @pitrou Please review. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#issuecomment-826478175


   https://issues.apache.org/jira/browse/ARROW-9299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10157: ARROW-9299: [C++][Python] Expose ORC metadata

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10157:
URL: https://github.com/apache/arrow/pull/10157#discussion_r621398746



##########
File path: cpp/src/arrow/adapters/orc/adapter.cc
##########
@@ -229,6 +229,17 @@ class ORCFileReader::Impl {
     return GetArrowSchema(type, out);
   }
 
+  Result<std::shared_ptr<KeyValueMetadata>> ReadMetadata() {
+    std::list<std::string> keys = reader_->getMetadataKeys();

Review comment:
       Make this `const`?

##########
File path: cpp/src/arrow/adapters/orc/adapter.cc
##########
@@ -229,6 +229,17 @@ class ORCFileReader::Impl {
     return GetArrowSchema(type, out);
   }
 
+  Result<std::shared_ptr<KeyValueMetadata>> ReadMetadata() {
+    std::list<std::string> keys = reader_->getMetadataKeys();
+    auto metadata = std::make_shared<KeyValueMetadata>();
+    if (!keys.empty()) {

Review comment:
       Why not `for (const auto& key : keys) {`?

##########
File path: cpp/src/arrow/adapters/orc/adapter.h
##########
@@ -27,6 +27,7 @@
 #include "arrow/status.h"
 #include "arrow/type.h"
 #include "arrow/type_fwd.h"
+#include "arrow/util/key_value_metadata.h"

Review comment:
       No need for this a priori, as `KeyValueMetadata` is already declared in `arrow/type_fwd.h`.

##########
File path: python/pyarrow/tests/test_orc.py
##########
@@ -110,20 +114,20 @@ def check_example_file(orc_path, expected_df, need_fix=False):
     json_pos = 0
     for i in range(orc_file.nstripes):
         batch = orc_file.read_stripe(i)
-        check_example_values(pd.DataFrame(batch.to_pydict()),
-                             expected_df,
-                             start=json_pos,
-                             stop=json_pos + len(batch))
+        check_example_values(
+            pd.DataFrame(batch.to_pydict()),
+            expected_df,
+            start=json_pos,
+            stop=json_pos + len(batch),
+        )

Review comment:
       Please avoid doing unrelated style changes.
   Also, if you want an automated style pass, you can use `archery lint --python --fix`
   (see https://arrow.apache.org/docs/developers/archery.html)

##########
File path: cpp/src/arrow/adapters/orc/adapter_test.cc
##########
@@ -289,6 +289,8 @@ TEST(TestAdapterRead, ReadIntAndStringFileMultipleStripes) {
   constexpr uint64_t reader_batch_size = 1024;
 
   auto writer = CreateWriter(stripe_size, *type, &mem_stream);
+  // writer->addUserMetadata("key1", "value1");
+  // writer->addUserMetadata("key2", "value2");

Review comment:
       Why the commented out code?

##########
File path: cpp/src/arrow/adapters/orc/adapter.h
##########
@@ -149,7 +156,7 @@ class ARROW_EXPORT ORCFileWriter {
   /// \brief Creates a new ORC writer.
   ///
   /// \param[in] output_stream a pointer to the io::OutputStream to write into
-  /// \return the returned writer object
+  /// \return the returned writer object or Status if error occurs

Review comment:
       There is no need to state this explicitly, as `Result<>` can implicitly convey an error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org