You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/03 13:14:32 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

jorisvandenbossche opened a new pull request #10224:
URL: https://github.com/apache/arrow/pull/10224


   We were already creating a Scanner to pass to the actual C++ `Write` method, so easy to also pass through a scanner if supplied by the user.
   
   (still need to add tests)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#issuecomment-831893790


   > Do you plan to raise the warning?
   
   So for now only added an error in case of specifying a Scanner (since that's a new case, we can directly raise an error if the schema is specified as well, no need to warn first). 
   And I thought to leave the other cases for a follow-up, as it is not strictly related to this PR, and I think we can first look into scanning/projecting a dataset to a given schema. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#issuecomment-831252095


   https://issues.apache.org/jira/browse/ARROW-12631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#issuecomment-831921158


   Thanks for opening the issue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #10224:
URL: https://github.com/apache/arrow/pull/10224


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#discussion_r625176491



##########
File path: python/pyarrow/dataset.py
##########
@@ -733,19 +733,17 @@ def write_dataset(data, base_dir, basename_template=None, format=None,
     """
     from pyarrow.fs import _resolve_filesystem_and_path
 
-    if isinstance(data, Dataset):
-        schema = schema or data.schema

Review comment:
       Yeah, either accepting a schema for a projection or having a separate schema keyword would be a nice convenience if all you want to do is select/cast columns.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#discussion_r625099784



##########
File path: python/pyarrow/dataset.py
##########
@@ -733,19 +733,17 @@ def write_dataset(data, base_dir, basename_template=None, format=None,
     """
     from pyarrow.fs import _resolve_filesystem_and_path
 
-    if isinstance(data, Dataset):
-        schema = schema or data.schema

Review comment:
       > Hmm, would the user ever want to pass a schema to force a cast? Though I suppose that's redundant (either configure the scanner or pass a scanner instead of a dataset).
   
   In that case, I think the user could have created the dataset with a specific (forced) schema: `ds.dataset(..., schema=schema)`. 
   But once you have a dataset, it's not easy to change the schema. The `scanner` method doesn't take a `schema` keyword (although that's maybe something to add, to project to the specified schema). 
   
   Yes, if we would want to raise, we can indeed raise a warning first. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#discussion_r625078445



##########
File path: python/pyarrow/dataset.py
##########
@@ -733,19 +733,17 @@ def write_dataset(data, base_dir, basename_template=None, format=None,
     """
     from pyarrow.fs import _resolve_filesystem_and_path
 
-    if isinstance(data, Dataset):
-        schema = schema or data.schema

Review comment:
       We are extracting the `schema` here, but this is not actually used any more, if the user passes a Dataset or Scanner. 
   
   This should be better documented, but to avoid surprises, could also error in that case? (or only if the schema's are not equal)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#discussion_r625082780



##########
File path: python/pyarrow/dataset.py
##########
@@ -733,19 +733,17 @@ def write_dataset(data, base_dir, basename_template=None, format=None,
     """
     from pyarrow.fs import _resolve_filesystem_and_path
 
-    if isinstance(data, Dataset):
-        schema = schema or data.schema

Review comment:
       Hmm, would the user ever want to pass a schema to force a cast? Though I suppose that's redundant (either configure the scanner or pass a scanner instead of a dataset).
   
   I think a deprecation warning would be appropriate here (schema will only be allowed when passing an iterable in the future).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10224: ARROW-12631: [Python] Accept Scanner in pyarrow.dataset.write_dataset

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10224:
URL: https://github.com/apache/arrow/pull/10224#issuecomment-831913973


   Sounds good to me. I filed ARROW-12647. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org