You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/27 01:16:11 UTC

[GitHub] [arrow] amoeba opened a new pull request, #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

amoeba opened a new pull request, #13983:
URL: https://github.com/apache/arrow/pull/13983

   Adds an additional numypdoc check to CI (PR03) and fixes all corresponding violations. 
   
   This includes one minor breaking change to parameter order in `pyarrow.parquet.read_table` which makes order more uniform across `ParquetDataset.__new__`, `read_table`, and `read_pandas` which have related signatures.
   
   Note this does not fully resolve [ARROW-15006](https://issues.apache.org/jira/browse/ARROW-15006).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche merged PR #13983:
URL: https://github.com/apache/arrow/pull/13983


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1229086799

   https://issues.apache.org/jira/browse/ARROW-15006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1230935663

   Thanks for reviewing this @jorisvandenbossche. I took your suggestion on keeping `skip_rows` and `skip_rows_after_names` together. I re-ran `archery numpydoc --allow-rule PR03` which still passes. This is ready for another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r956517419


##########
python/pyarrow/parquet/core.py:
##########
@@ -2782,9 +2782,9 @@ def partitioning(self):
 
 
 def read_table(source, *, columns=None, use_threads=True, metadata=None,
-               schema=None, use_pandas_metadata=False, memory_map=False,
-               read_dictionary=None, filesystem=None, filters=None,
-               buffer_size=0, partitioning="hive", use_legacy_dataset=False,
+               schema=None, use_pandas_metadata=False, read_dictionary=None,
+               memory_map=False, buffer_size=0, partitioning="hive",
+               filesystem=None, filters=None, use_legacy_dataset=False,

Review Comment:
   This seems breaking to me, though I'm not sure if it's of much concern. Feedback appreciated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r971176354


##########
python/pyarrow/_dataset.pyx:
##########
@@ -2299,6 +2299,16 @@ cdef class Scanner(_Weakrefable):
         ----------
         dataset : Dataset
             Dataset to scan.
+        use_threads : bool, default True

Review Comment:
   I pushed a change to address this in https://github.com/apache/arrow/pull/13983/commits/d04b2283ecee55cd1be02d2092a7b06f54b94514 that:
   
   - Re-orders the signatures and docstrings inside the Scanner class to be consistent and follow what I think is a sensible order
   - Makes `Scanner.from_dataset`, `Scanner.from_fragment`, and `Scanner.from_batches` use the above pattern to make all args except the first keyword only
   
   Waiting on CI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r956517419


##########
python/pyarrow/parquet/core.py:
##########
@@ -2782,9 +2782,9 @@ def partitioning(self):
 
 
 def read_table(source, *, columns=None, use_threads=True, metadata=None,
-               schema=None, use_pandas_metadata=False, memory_map=False,
-               read_dictionary=None, filesystem=None, filters=None,
-               buffer_size=0, partitioning="hive", use_legacy_dataset=False,
+               schema=None, use_pandas_metadata=False, read_dictionary=None,
+               memory_map=False, buffer_size=0, partitioning="hive",
+               filesystem=None, filters=None, use_legacy_dataset=False,

Review Comment:
   This seems breaking to me, though I'm not sure if it's of much concern. Feedback appreciated and happy to go another direction in order to get the docstring order right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r956517419


##########
python/pyarrow/parquet/core.py:
##########
@@ -2782,9 +2782,9 @@ def partitioning(self):
 
 
 def read_table(source, *, columns=None, use_threads=True, metadata=None,
-               schema=None, use_pandas_metadata=False, memory_map=False,
-               read_dictionary=None, filesystem=None, filters=None,
-               buffer_size=0, partitioning="hive", use_legacy_dataset=False,
+               schema=None, use_pandas_metadata=False, read_dictionary=None,
+               memory_map=False, buffer_size=0, partitioning="hive",
+               filesystem=None, filters=None, use_legacy_dataset=False,

Review Comment:
   I consider this a breaking change, though maybe on a minor level. Feedback appreciated and happy to go another direction in order to get the docstring order right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1229086814

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r999439418


##########
python/pyarrow/_dataset.pyx:
##########
@@ -2484,9 +2482,11 @@ cdef class Scanner(_Weakrefable):
             record batches are overflowing memory then this method can be
             called to reduce their size.
         batch_readahead : int, default 16
-            The number of batches to read ahead in a file. This might not work
-            for all file formats. Increasing this number will increase

Review Comment:
   Same here



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2397,12 +2396,14 @@ cdef class Scanner(_Weakrefable):
             record batches are overflowing memory then this method can be
             called to reduce their size.
         batch_readahead : int, default 16
-            The number of batches to read ahead in a file. This might not work
-            for all file formats. Increasing this number will increase

Review Comment:
   It seems you (accidentally?) lost the "This might not work for all file formats" ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r999944630


##########
python/pyarrow/_dataset.pyx:
##########
@@ -2397,12 +2396,14 @@ cdef class Scanner(_Weakrefable):
             record batches are overflowing memory then this method can be
             called to reduce their size.
         batch_readahead : int, default 16
-            The number of batches to read ahead in a file. This might not work
-            for all file formats. Increasing this number will increase

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r999944404


##########
python/pyarrow/_dataset.pyx:
##########
@@ -2484,9 +2482,11 @@ cdef class Scanner(_Weakrefable):
             record batches are overflowing memory then this method can be
             called to reduce their size.
         batch_readahead : int, default 16
-            The number of batches to read ahead in a file. This might not work
-            for all file formats. Increasing this number will increase

Review Comment:
   Good catch. When I was in there fixing this, I found one more instance that wasn't consistent. All three instances of this docstring now have the same text,
   
   > The number of batches to read ahead in a file. This might not work
           for all file formats. Increasing this number will increase
           RAM usage but could also improve IO utilization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r963779801


##########
python/pyarrow/_dataset.pyx:
##########
@@ -2299,6 +2299,16 @@ cdef class Scanner(_Weakrefable):
         ----------
         dataset : Dataset
             Dataset to scan.
+        use_threads : bool, default True

Review Comment:
   For this one, I would maybe also change the order of the signature instead, and make all keywords (except the first) keyword-only, using `, *,`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1275461955

   Sorry for the noise, I was cleaning up my fork and wiped out the commits. They've since been restored and this is ready for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r957096533


##########
python/pyarrow/_csv.pyx:
##########
@@ -104,14 +104,6 @@ cdef class ReadOptions(_Weakrefable):
     skip_rows : int, optional (default 0)
         The number of rows to skip before the column names (if any)
         and the CSV data.
-    skip_rows_after_names : int, optional (default 0)

Review Comment:
   Maybe we could rather change the order in the actual signature, since it seems useful to have skip_rows and skip_rows_after_names together?



##########
python/pyarrow/parquet/core.py:
##########
@@ -2782,9 +2782,9 @@ def partitioning(self):
 
 
 def read_table(source, *, columns=None, use_threads=True, metadata=None,
-               schema=None, use_pandas_metadata=False, memory_map=False,
-               read_dictionary=None, filesystem=None, filters=None,
-               buffer_size=0, partitioning="hive", use_legacy_dataset=False,
+               schema=None, use_pandas_metadata=False, read_dictionary=None,
+               memory_map=False, buffer_size=0, partitioning="hive",
+               filesystem=None, filters=None, use_legacy_dataset=False,

Review Comment:
   This shouldn't be a deprecating change, since the parameters here are keyword-argument-only (only the first `source` argument is allowed as positional argument, due to the `*` in the signature)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #13983:
URL: https://github.com/apache/arrow/pull/13983#discussion_r957842230


##########
python/pyarrow/_csv.pyx:
##########
@@ -104,14 +104,6 @@ cdef class ReadOptions(_Weakrefable):
     skip_rows : int, optional (default 0)
         The number of rows to skip before the column names (if any)
         and the CSV data.
-    skip_rows_after_names : int, optional (default 0)

Review Comment:
   That makes sense. And other places in [pyarrow](https://github.com/apache/arrow/blob/7035def5ba77218c287dccbd84ae299743379fcb/docs/source/python/csv.rst) and [libarrow](https://github.com/apache/arrow/blob/7035def5ba77218c287dccbd84ae299743379fcb/cpp/src/arrow/csv/options.h#L151-L155) that reference both have them grouped together. Changed in https://github.com/apache/arrow/pull/13983/commits/44acfb34e5efad81726b5852da7cf13621065097. Thanks.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1247258732

   Thanks @jorisvandenbossche, the conflict should now be resolved. I made an attempt at addressing [your comment](https://github.com/apache/arrow/pull/13983#discussion_r963779801) so let me know if that's what you were thinking. This is ready for another look when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba closed pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba closed pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03
URL: https://github.com/apache/arrow/pull/13983


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1284588549

   Thanks for the review @jorisvandenbossche. I didn't mean to remove that bit and I'm not sure how I did. Good catch! All three instances of that param docstring now have the expected text.
   
   Ready for review/merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #13983: ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #13983:
URL: https://github.com/apache/arrow/pull/13983#issuecomment-1261426171

   Ping @jorisvandenbossche: Any chance you have a spare moment to take a look at this again? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org