You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/04 17:43:54 UTC
[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12765: ARROW-15523: [Python] Support for Datasets as inputs of Joins

jorisvandenbossche commented on code in PR #12765:
URL: https://github.com/apache/arrow/pull/12765#discussion_r841978543


##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -61,10 +65,18 @@ cdef execplan(inputs, output_type, vector[CDeclaration] plan, c_bool use_threads
         CTable* c_table
         shared_ptr[CTable] c_out_table
         shared_ptr[CSourceNodeOptions] c_sourceopts
+        shared_ptr[CScanNodeOptions] c_scanopts
+        shared_ptr[CExecNodeOptions] c_input_node_opts
         shared_ptr[CSinkNodeOptions] c_sinkopts
         shared_ptr[CAsyncExecBatchGenerator] c_async_exec_batch_gen
         shared_ptr[CRecordBatchReader] c_recordbatchreader
         vector[CDeclaration].iterator plan_iter
+        vector[CDeclaration.Input] no_c_inputs

Review Comment:
   What does the "no_" stand for? (and what's the reason it was added, as it doesn't seem to be actually used except passed uninitialized?)



##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -61,10 +65,18 @@ cdef execplan(inputs, output_type, vector[CDeclaration] plan, c_bool use_threads
         CTable* c_table
         shared_ptr[CTable] c_out_table
         shared_ptr[CSourceNodeOptions] c_sourceopts
+        shared_ptr[CScanNodeOptions] c_scanopts
+        shared_ptr[CExecNodeOptions] c_input_node_opts
         shared_ptr[CSinkNodeOptions] c_sinkopts
         shared_ptr[CAsyncExecBatchGenerator] c_async_exec_batch_gen
         shared_ptr[CRecordBatchReader] c_recordbatchreader
         vector[CDeclaration].iterator plan_iter
+        vector[CDeclaration.Input] no_c_inputs
+
+    global _dataset_support_initialised
+    if not _dataset_support_initialised:
+        Initialize()  # Initialise support for Datasets in ExecPlan
+        _dataset_support_initialised = True

Review Comment:
   Is there a reason to not do this at the module level? Is it costly in case the user would not end up using datasets?



##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -81,21 +93,31 @@ cdef execplan(inputs, output_type, vector[CDeclaration] plan, c_bool use_threads
     # Create source nodes for each input
     for ipt in inputs:
         if isinstance(ipt, Table):
+            node_factory = "source"
             c_in_table = pyarrow_unwrap_table(ipt).get()
             c_sourceopts = GetResultValue(
                 CSourceNodeOptions.FromTable(deref(c_in_table), deref(c_exec_context).executor()))
+            c_input_node_opts.swap(deref(<shared_ptr[CExecNodeOptions]*>&c_sourceopts))

Review Comment:
   We have a `static_pointer_cast` cython helper that might be useful? (not sure if that would make the code actually easier, though, if it works)



##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -176,6 +199,57 @@ def tables_join(join_type, left_table not None, left_keys,
     -------
     result_table : Table
     """
+    return _perform_join(join_type, left_table, left_keys,
+                         right_table, right_keys, left_suffix,
+                         right_suffix, use_threads, coalesce_keys)
+
+
+def datasets_join(join_type, left_dataset not None, left_keys,

Review Comment:
   Given that those functions are not public, does it matter much to have both `datasets_join` and `tables_join`? As right now, since they are exactly the same under the hood, you can actually pass a dataset to the tables version and vice versa. 
   
   (but also because it's not public, it also doesn't matter that much for now, if you have plans to make more changes to it :))



##########
python/pyarrow/lib.pyx:
##########
@@ -120,6 +120,8 @@ UnionMode_DENSE = _UnionMode_DENSE
 
 def _pc():
     import pyarrow.compute as pc
+    from pyarrow import _exec_plan
+    pc._exec_plan = _exec_plan

Review Comment:
   In theory, pyarrow.dataset is not necessarily enabled / built, which would mean that then all pyarrow.compute based methods would also fail? 
   (although I am not sure whether we still have (nightly) builds that exercise this option..)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org