You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/29 16:39:30 UTC

[GitHub] [arrow] zhztheplayer opened a new pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

zhztheplayer opened a new pull request #10201:
URL: https://github.com/apache/arrow/pull/10201


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r694442690



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       Hello @kiszk , this part of code (Cross JNI data sharing) was moved to https://github.com/apache/arrow/pull/10883 as a dependency of this PR, would you like to have a look? Thank a lot. I have added a separated style fix commit in that one. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-893348819


   Part of this work around ARROW-7272 has been moved to another PR https://github.com/apache/arrow/pull/10883.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-893348819


   Part of this work around ARROW-7272 has been moved to another PR https://github.com/apache/arrow/pull/10883.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-829496599


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r637363012



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       Sorry about that. :( Will make split changes next time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r637362031



##########
File path: java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java
##########
@@ -52,7 +52,7 @@ private JniWrapper() {
    * Create Dataset from a DatasetFactory and get the native pointer of the Dataset.
    *
    * @param datasetFactoryId the native pointer of the arrow::dataset::DatasetFactory instance.
-   * @param schema the predefined schema of the resulting Dataset.
+   * @param schema           the predefined schema of the resulting Dataset.

Review comment:
       Thanks! But I don't see we always make consistent practices of Javadoc alignment in Arrow Java code. E.g.
   
   https://github.com/apache/arrow/blob/78c88a9f517b540ea010a6bd6a8c8f6d91769559/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/JniWrapper.java#L34-L40
   
   Maybe what we need here is to add a relevant rule to checkstyle config and do a complete clean up?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
kiszk commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r694165098



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       I agree with @emkornfield. To split these types of changes into another PR make us easier to review this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
kiszk commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r632975139



##########
File path: java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java
##########
@@ -52,7 +52,7 @@ private JniWrapper() {
    * Create Dataset from a DatasetFactory and get the native pointer of the Dataset.
    *
    * @param datasetFactoryId the native pointer of the arrow::dataset::DatasetFactory instance.
-   * @param schema the predefined schema of the resulting Dataset.
+   * @param schema           the predefined schema of the resulting Dataset.

Review comment:
       nit: spaces are not necessary




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r635787880



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;

Review comment:
       nit: can you prefix org with "::"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r637398013



##########
File path: cpp/src/jni/dataset/jni_wrapper.cc
##########
@@ -166,6 +156,126 @@ class DisposableScannerAdaptor {
   }
 };
 
+/// \brief Simple scan task implementation that is constructed directly
+/// from a record batch iterator (and its corresponding fragment).
+class SimpleIteratorTask : public arrow::dataset::ScanTask {

Review comment:
       Fixed in https://github.com/apache/arrow/pull/10201/commits/71cba257cfd24cd3b37b00f8e478cbceab8bc440. Thanks for adding the facility.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r694443806



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       Or you think we should have a individual PR rather than a commit to fix the style stuffs? I am okay to either way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
kiszk commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r694165098



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       I agree with @emkornfield. To split these types of changes into another PR make us easier to review this PR at this time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r623237840



##########
File path: cpp/src/jni/dataset/proto/record_batch.proto
##########
@@ -1,55 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one
-// or more contributor license agreements.  See the NOTICE file
-// distributed with this work for additional information
-// regarding copyright ownership.  The ASF licenses this file
-// to you under the Apache License, Version 2.0 (the
-// "License"); you may not use this file except in compliance
-// with the License.  You may obtain a copy of the License at
-//
-//   http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing,
-// software distributed under the License is distributed on an
-// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-// KIND, either express or implied.  See the License for the
-// specific language governing permissions and limitations
-// under the License.
-
-syntax = "proto2";
-package arrow.dataset.jni;
-
-option java_package = "org.apache.arrow.dataset.jni";
-option java_outer_classname = "RecordBatchProtos";

Review comment:
       @emkornfield
   I managed to migrate to existing flatbuffers API from protobuf.
   
   previous comment https://github.com/apache/arrow/pull/10108#discussion_r619887946




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer edited a comment on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer edited a comment on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-869494651


   Hi @emkornfield, do you have any other comments on the current code? Sorry for pinging you again as I would like to see if we can make this into 5.0.0.
   
   To ease review, the changes was split into two parts:
   - Refactoring to allow passing native java buffers to C++ code:
   https://github.com/apache/arrow/pull/10201/commits/4a0ddbec32c58e079dcc5973620bd402bd1456fe (the first commit)
   - Allowing writes:
   Other commits. [diff](https://github.com/zhztheplayer/arrow-1/compare/4a0ddbec32c58e079dcc5973620bd402bd1456fe..cc13ac282f3f4ec1fdfa010b52795f74f07e2057)
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r637397245



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -162,31 +192,40 @@ std::shared_ptr<ReservationListener> ReservationListenableMemoryPool::get_listen
 
 ReservationListenableMemoryPool::~ReservationListenableMemoryPool() {}
 
+Status CheckException(JNIEnv* env) {
+  if (env->ExceptionCheck()) {
+    env->ExceptionDescribe();
+    env->ExceptionClear();
+    return Status::Invalid("Error during calling Java code from native code");

Review comment:
       Fixed in commit 922d6dd143c12984daa788d9d737a75914e8a678.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r637362891



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;

Review comment:
       Thanks but did you mean to use `namespace flatbuf = ::org::apache::arrow::flatbuf;`? I did a search in our C++ code base using RegEx `namespace \w* = `, and it seems to be a common practice not adding the leading `::`
   ```sh
   [root@localhost arrow]# grep -R "namespace [a-z A-Z][a-z A-Z]* =.*" cpp/src/
   cpp/src/parquet/column_reader.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/column_writer.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/encoding_test.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/arrow/reader.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/arrow/reader_internal.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/encoding.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/column_writer_test.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/parquet/statistics_test.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/jni/dataset/jni_util.cc:namespace flatbuf = org::apache::arrow::flatbuf;
   cpp/src/arrow/ipc/json_simple.cc:namespace rj = arrow::rapidjson;
   cpp/src/arrow/ipc/metadata_internal.cc:namespace flatbuf = org::apache::arrow::flatbuf;
   cpp/src/arrow/ipc/reader.cc:namespace flatbuf = org::apache::arrow::flatbuf;
   cpp/src/arrow/ipc/metadata_internal.h:namespace flatbuf = org::apache::arrow::flatbuf;
   cpp/src/arrow/compute/kernels/aggregate_benchmark.cc:namespace BitUtil = arrow::BitUtil;
   cpp/src/arrow/filesystem/s3_test_util.h:namespace bp = boost::process;
   cpp/src/arrow/filesystem/s3fs_test.cc:namespace bp = boost::process;
   cpp/src/arrow/gpu/cuda_arrow_ipc.cc:namespace flatbuf = org::apache::arrow::flatbuf;
   cpp/src/arrow/flight/flight_benchmark.cc:namespace perf = arrow::flight::perf;
   cpp/src/arrow/flight/test_util.cc:namespace bp = boost::process;
   cpp/src/arrow/flight/test_util.cc:namespace fs = boost::filesystem;
   cpp/src/arrow/flight/test_util.cc:  namespace fs = boost::filesystem;
   cpp/src/arrow/flight/internal.h:namespace pb = arrow::flight::protocol;
   cpp/src/arrow/flight/server.cc:namespace pb = arrow::flight::protocol;
   cpp/src/arrow/flight/serialization_internal.cc:namespace pb = arrow::flight::protocol;
   cpp/src/arrow/flight/client.cc:namespace pb = arrow::flight::protocol;
   cpp/src/arrow/flight/client.cc:          namespace ge = GRPC_NAMESPACE_FOR_TLS_CREDENTIALS_OPTIONS;
   cpp/src/arrow/flight/perf_server.cc:namespace perf = arrow::flight::perf;
   cpp/src/arrow/flight/perf_server.cc:namespace proto = arrow::flight::protocol;
   cpp/src/arrow/flight/flight_test.cc:namespace pb = arrow::flight::protocol;
   cpp/src/arrow/adapters/orc/adapter.cc:namespace liborc = orc;
   cpp/src/arrow/adapters/orc/adapter_util.h:namespace liborc = orc;
   cpp/src/arrow/adapters/orc/adapter_test.cc:namespace liborc = orc;
   cpp/src/arrow/adapters/orc/adapter_util.cc:namespace liborc = orc;
   cpp/src/arrow/json/object_parser.cc:namespace rj = arrow::rapidjson;
   cpp/src/arrow/json/test_common.h:namespace rj = arrow::rapidjson;
   cpp/src/arrow/json/chunker.cc:namespace rj = arrow::rapidjson;
   cpp/src/arrow/json/parser.cc:namespace rj = arrow::rapidjson;
   cpp/src/arrow/json/object_writer.cc:namespace rj = arrow::rapidjson;
   cpp/src/arrow/testing/json_internal.h:namespace rj = arrow::rapidjson;
   cpp/src/plasma/store.cc:namespace fb = plasma::flatbuf;
   cpp/src/plasma/client.cc:namespace fb = plasma::flatbuf;
   cpp/src/plasma/common.cc:namespace fb = plasma::flatbuf;
   cpp/src/plasma/plasma.cc:namespace fb = plasma::flatbuf;
   cpp/src/plasma/test/serialization_tests.cc:namespace fb = plasma::flatbuf;
   cpp/src/plasma/protocol.cc:namespace fb = plasma::flatbuf;
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10201: WIP: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-904273483


   https://issues.apache.org/jira/browse/ARROW-11776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on pull request #10201: WIP: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-904273513


   Marked title as WIP to let https://github.com/apache/arrow/pull/10883 to be reviewed first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r636040540



##########
File path: cpp/src/jni/dataset/jni_wrapper.cc
##########
@@ -166,6 +156,126 @@ class DisposableScannerAdaptor {
   }
 };
 
+/// \brief Simple scan task implementation that is constructed directly
+/// from a record batch iterator (and its corresponding fragment).
+class SimpleIteratorTask : public arrow::dataset::ScanTask {

Review comment:
       ScanTask is going away. However, from a quick glance, what we want here is a Scanner from a RecordBatchIterator. Use the existing Scanner::FromRecordBatchReader:
   
   https://github.com/apache/arrow/blob/8b0e04961790003abb32ef03e19a4604060e88ec/cpp/src/arrow/dataset/scanner.h#L323-L329




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r635789662



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -162,31 +192,40 @@ std::shared_ptr<ReservationListener> ReservationListenableMemoryPool::get_listen
 
 ReservationListenableMemoryPool::~ReservationListenableMemoryPool() {}
 
+Status CheckException(JNIEnv* env) {
+  if (env->ExceptionCheck()) {
+    env->ExceptionDescribe();
+    env->ExceptionClear();
+    return Status::Invalid("Error during calling Java code from native code");

Review comment:
       including the message with Invalid could aid users.  Status's have a member for [arbitrary details](https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/status.h#L101).  it might pay to create a new extension of this class that can keep a reference to the exception?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-846405682


   @emkornfield 
   
   > Refactoring to allow passing native java buffers to C++ code. For this, another option that might be simpler/more reliable is to try to use the Arrow C ABI to translate the data across java and C++. Did you consider this?
   
   Once we had a short discussion around this https://issues.apache.org/jira/browse/ARROW-7272?focusedCommentId=16983849&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983849
   
   C data interface does sound to be better. Although I believe a Java implementation will require a design that probably tightly relies on JNI which makes the implementation slightly more complex than in other language's case. I think we can open a separate JIRA ticket for that to do the work in future as we don't yet have the implementation in Java. So far I have extracted the [commit](https://github.com/apache/arrow/pull/10201/commits/03e3c2c924f1f4f21a66b5c4cec1d4ab2d3fee18) that refactored current code within flatbuffers and marked it to resolve the issue ARROW-7272 (if it's OK I may open an individual PR within the commit for 7272). What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r635791186



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -211,32 +250,183 @@ std::vector<std::string> ToStringVector(JNIEnv* env, jobjectArray& str_array) {
   return vector;
 }
 
-arrow::Result<jbyteArray> ToSchemaByteArray(JNIEnv* env,
-                                            std::shared_ptr<arrow::Schema> schema) {
-  ARROW_ASSIGN_OR_RAISE(
-      std::shared_ptr<arrow::Buffer> buffer,
-      arrow::ipc::SerializeSchema(*schema, arrow::default_memory_pool()))
+Result<jbyteArray> ToSchemaByteArray(JNIEnv* env, std::shared_ptr<Schema> schema) {
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Buffer> buffer,
+                        ipc::SerializeSchema(*schema, default_memory_pool()))
 
   jbyteArray out = env->NewByteArray(buffer->size());
   auto src = reinterpret_cast<const jbyte*>(buffer->data());
   env->SetByteArrayRegion(out, 0, buffer->size(), src);
   return out;
 }
 
-arrow::Result<std::shared_ptr<arrow::Schema>> FromSchemaByteArray(
-    JNIEnv* env, jbyteArray schemaBytes) {
-  arrow::ipc::DictionaryMemo in_memo;
+Result<std::shared_ptr<Schema>> FromSchemaByteArray(JNIEnv* env, jbyteArray schemaBytes) {
+  ipc::DictionaryMemo in_memo;
   int schemaBytes_len = env->GetArrayLength(schemaBytes);
   jbyte* schemaBytes_data = env->GetByteArrayElements(schemaBytes, nullptr);
-  auto serialized_schema = std::make_shared<arrow::Buffer>(
+  auto serialized_schema = std::make_shared<Buffer>(
       reinterpret_cast<uint8_t*>(schemaBytes_data), schemaBytes_len);
-  arrow::io::BufferReader buf_reader(serialized_schema);
-  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Schema> schema,
-                        arrow::ipc::ReadSchema(&buf_reader, &in_memo))
+  io::BufferReader buf_reader(serialized_schema);
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Schema> schema,
+                        ipc::ReadSchema(&buf_reader, &in_memo))
   env->ReleaseByteArrayElements(schemaBytes, schemaBytes_data, JNI_ABORT);
   return schema;
 }
 
+Status SetMetadataForSingleField(std::shared_ptr<ArrayData> array_data,
+                                 std::vector<ipc::internal::FieldMetadata>& nodes_meta,
+                                 std::vector<ipc::internal::BufferMetadata>& buffers_meta,
+                                 std::shared_ptr<KeyValueMetadata>& custom_metadata) {
+  nodes_meta.push_back({array_data->length, array_data->null_count, 0L});
+
+  for (size_t i = 0; i < array_data->buffers.size(); i++) {
+    auto buffer = array_data->buffers.at(i);
+    uint8_t* data = nullptr;
+    int64_t size = 0;
+    if (buffer != nullptr) {
+      data = (uint8_t*)buffer->data();
+      size = buffer->size();
+    }
+    ipc::internal::BufferMetadata buffer_metadata{};
+    buffer_metadata.offset = reinterpret_cast<int64_t>(data);
+    buffer_metadata.length = size;
+    // store buffer refs into custom metadata
+    jlong ref = CreateNativeRef(buffer);
+    custom_metadata->Append(
+        "NATIVE_BUFFER_REF_" + std::to_string(i),
+        util::base64_encode(reinterpret_cast<unsigned char*>(&ref), sizeof(ref)));

Review comment:
       what is the performance overhead of this.  Could another approach be to use to record batches.  One that contains the data, and one that contains all the reference pointers?
   
   Or at the very least only one metadata entry and encode all the buffer references as some sort of flatbuffer, protobuf or json list?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-846406069


   JNI CI failure is related to an existing issue https://issues.apache.org/jira/browse/ARROW-12838


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r635788017



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       please try to avoid these types of cleanups with large functional changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#issuecomment-869494651


   Hi @emkornfield, do you have any other comments on the current code? Sorry for pinging you again as I would like to see if we can make this into 5.0.0.
   
   To ease review, the changes was split into two parts:
   - Refactoring to allow passing native java buffers to C++ code:
   https://github.com/apache/arrow/pull/10201/commits/4a0ddbec32c58e079dcc5973620bd402bd1456fe (the first commit)
   - Allowing writes:
   Other commits. [diff](https://github.com/zhztheplayer/arrow-1/compare/4a0ddbec32c58e079dcc5973620bd402bd1456fe..cc13ac282f3f4ec1fdfa010b52795f74f07e2057#diff-9769218530f3dee1dad702087bce0c1dc4d31613168ed5f7ca296878c7c303a7R134)
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zhztheplayer commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
zhztheplayer commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r637397914



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -211,32 +250,183 @@ std::vector<std::string> ToStringVector(JNIEnv* env, jobjectArray& str_array) {
   return vector;
 }
 
-arrow::Result<jbyteArray> ToSchemaByteArray(JNIEnv* env,
-                                            std::shared_ptr<arrow::Schema> schema) {
-  ARROW_ASSIGN_OR_RAISE(
-      std::shared_ptr<arrow::Buffer> buffer,
-      arrow::ipc::SerializeSchema(*schema, arrow::default_memory_pool()))
+Result<jbyteArray> ToSchemaByteArray(JNIEnv* env, std::shared_ptr<Schema> schema) {
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Buffer> buffer,
+                        ipc::SerializeSchema(*schema, default_memory_pool()))
 
   jbyteArray out = env->NewByteArray(buffer->size());
   auto src = reinterpret_cast<const jbyte*>(buffer->data());
   env->SetByteArrayRegion(out, 0, buffer->size(), src);
   return out;
 }
 
-arrow::Result<std::shared_ptr<arrow::Schema>> FromSchemaByteArray(
-    JNIEnv* env, jbyteArray schemaBytes) {
-  arrow::ipc::DictionaryMemo in_memo;
+Result<std::shared_ptr<Schema>> FromSchemaByteArray(JNIEnv* env, jbyteArray schemaBytes) {
+  ipc::DictionaryMemo in_memo;
   int schemaBytes_len = env->GetArrayLength(schemaBytes);
   jbyte* schemaBytes_data = env->GetByteArrayElements(schemaBytes, nullptr);
-  auto serialized_schema = std::make_shared<arrow::Buffer>(
+  auto serialized_schema = std::make_shared<Buffer>(
       reinterpret_cast<uint8_t*>(schemaBytes_data), schemaBytes_len);
-  arrow::io::BufferReader buf_reader(serialized_schema);
-  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Schema> schema,
-                        arrow::ipc::ReadSchema(&buf_reader, &in_memo))
+  io::BufferReader buf_reader(serialized_schema);
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Schema> schema,
+                        ipc::ReadSchema(&buf_reader, &in_memo))
   env->ReleaseByteArrayElements(schemaBytes, schemaBytes_data, JNI_ABORT);
   return schema;
 }
 
+Status SetMetadataForSingleField(std::shared_ptr<ArrayData> array_data,
+                                 std::vector<ipc::internal::FieldMetadata>& nodes_meta,
+                                 std::vector<ipc::internal::BufferMetadata>& buffers_meta,
+                                 std::shared_ptr<KeyValueMetadata>& custom_metadata) {
+  nodes_meta.push_back({array_data->length, array_data->null_count, 0L});
+
+  for (size_t i = 0; i < array_data->buffers.size(); i++) {
+    auto buffer = array_data->buffers.at(i);
+    uint8_t* data = nullptr;
+    int64_t size = 0;
+    if (buffer != nullptr) {
+      data = (uint8_t*)buffer->data();
+      size = buffer->size();
+    }
+    ipc::internal::BufferMetadata buffer_metadata{};
+    buffer_metadata.offset = reinterpret_cast<int64_t>(data);
+    buffer_metadata.length = size;
+    // store buffer refs into custom metadata
+    jlong ref = CreateNativeRef(buffer);
+    custom_metadata->Append(
+        "NATIVE_BUFFER_REF_" + std::to_string(i),
+        util::base64_encode(reinterpret_cast<unsigned char*>(&ref), sizeof(ref)));

Review comment:
       Fixed commit 8ef07be7b42cc9a4e83326f6a785457d664678ee.
   
   Here I chose to encode the references within json arrays. It might be possible to use extra arrow arrays/recordbatch to store the reference pointers but we still have to manage the reference pointers to the ref arrays/recordbatch themselves. Let's have a rework on this within implementing C data interface for Java in future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on a change in pull request #10201: WIP: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
kiszk commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r694457267



##########
File path: cpp/src/jni/dataset/jni_util.cc
##########
@@ -16,36 +16,44 @@
 // under the License.
 
 #include "jni/dataset/jni_util.h"
-
+#include "arrow/ipc/metadata_internal.h"
+#include "arrow/util/base64.h"
+#include "arrow/util/key_value_metadata.h"
 #include "arrow/util/logging.h"
 
+#include <memory>
 #include <mutex>
 
+#include <flatbuffers/flatbuffers.h>
+
 namespace arrow {
+
+namespace flatbuf = org::apache::arrow::flatbuf;
+
 namespace dataset {
 namespace jni {
 
 class ReservationListenableMemoryPool::Impl {
  public:
-  explicit Impl(arrow::MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
+  explicit Impl(MemoryPool* pool, std::shared_ptr<ReservationListener> listener,
                 int64_t block_size)
       : pool_(pool),
         listener_(listener),
         block_size_(block_size),
         blocks_reserved_(0),
         bytes_reserved_(0) {}
 
-  arrow::Status Allocate(int64_t size, uint8_t** out) {

Review comment:
       Sure, let me take a look.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10201: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10201:
URL: https://github.com/apache/arrow/pull/10201#discussion_r635792044



##########
File path: cpp/src/jni/dataset/jni_wrapper.cc
##########
@@ -166,6 +156,126 @@ class DisposableScannerAdaptor {
   }
 };
 
+/// \brief Simple scan task implementation that is constructed directly
+/// from a record batch iterator (and its corresponding fragment).
+class SimpleIteratorTask : public arrow::dataset::ScanTask {

Review comment:
       CC @lidavidm @westonpace what is the current status of changes in cardinality for ScanTask?  is this a use that should be supported long term?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org