You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/19 14:57:40 UTC

[GitHub] [arrow] AlenkaF opened a new pull request, #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

AlenkaF opened a new pull request, #13199:
URL: https://github.com/apache/arrow/pull/13199

   This PR adds `doctest` functionality to ensure that docstring examples are actually correct (and keep being correct).
   
   - [ ] Add `--doctest-module`
   - [ ] Add `--doctest-cython`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r881543095


##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pytest
+from pyarrow import Codec
+
+groups = [
+    'brotli',
+    'bz2',
+    'cython',
+    'dataset',
+    'hypothesis',
+    'fastparquet',
+    'gandiva',
+    'gdb',
+    'gzip',
+    'hdfs',
+    'large_memory',
+    'lz4',
+    'memory_leak',
+    'nopandas',
+    'orc',
+    'pandas',
+    'parquet',
+    'parquet_encryption',
+    'plasma',
+    's3',
+    'snappy',
+    'substrait',
+    'tensorflow',
+    'flight',
+    'slow',
+    'requires_testing_data',
+    'zstd',
+]
+
+defaults = {
+    'brotli': Codec.is_available('brotli'),
+    'bz2': Codec.is_available('bz2'),
+    'cython': False,
+    'dataset': False,
+    'fastparquet': False,
+    'flight': False,
+    'gandiva': False,
+    'gdb': True,
+    'gzip': Codec.is_available('gzip'),
+    'hdfs': False,
+    'hypothesis': False,
+    'large_memory': False,
+    'lz4': Codec.is_available('lz4'),
+    'memory_leak': False,
+    'nopandas': False,
+    'orc': False,
+    'pandas': False,
+    'parquet': False,
+    'parquet_encryption': False,
+    'plasma': False,
+    'requires_testing_data': True,
+    's3': False,
+    'slow': False,
+    'snappy': Codec.is_available('snappy'),
+    'substrait': False,
+    'tensorflow': False,
+    'zstd': Codec.is_available('zstd'),
+}
+
+try:
+    import cython  # noqa
+    defaults['cython'] = True
+except ImportError:
+    pass
+
+try:
+    import fastparquet  # noqa
+    defaults['fastparquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.gandiva  # noqa
+    defaults['gandiva'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pandas  # noqa
+    defaults['pandas'] = True
+except ImportError:
+    defaults['nopandas'] = True
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet_encryption'] = True
+except ImportError:
+    pass
+
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import tensorflow  # noqa
+    defaults['tensorflow'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['s3'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import HadoopFileSystem  # noqa
+    defaults['hdfs'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.substrait  # noqa
+    defaults['substrait'] = True
+except ImportError:
+    pass
+
+
+# Doctest should ignore files for the modules that are not built
+def pytest_ignore_collect(path, config):
+    if config.option.doctestmodules:
+        # don't try to run doctests on the /tests directory
+        if "/pyarrow/tests/" in str(path):
+            return True
+
+        doctest_groups = [
+            'dataset',
+            'orc',
+            'parquet',
+            'plasma',
+            'flight',
+        ]
+
+        # handle cuda, flight, etc
+        for group in doctest_groups:
+            if 'pyarrow/{}'.format(group) in str(path) and \
+               not defaults[group]:
+                return True
+
+        if 'pyarrow/parquet/encryption' in str(path) and \
+           not defaults['parquet_encryption']:
+            return True
+
+        if 'pyarrow/cuda' in str(path):
+            try:
+                import pyarrow.cuda  # noqa
+                return False
+            except ImportError:
+                return True
+
+        if 'pyarrow/fs' in str(path):
+            try:
+                from pyarrow.fs import S3FileSystem  # noqa
+                return False
+            except ImportError:
+                return True
+
+    return False
+
+
+# Save output files from doctest examples into temp dir
+@pytest.fixture(autouse=True)

Review Comment:
   I ran it with and without this fixture on test_arrays.py, and it doesn't give a noticeable difference, so probably not worth looking into making this "smarter" with a dynamic scope.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1131837045

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r880100079


##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None, filesystem=None,
 
     Examples
     --------
+    Creating an example pa.Table:

Review Comment:
   ```suggestion
       Creating an example Table:
   ```



##########
python/pyarrow/dataset.py:
##########
@@ -155,46 +155,55 @@ def partitioning(schema=None, field_names=None, flavor=None,
 
     Specify the Schema for paths like "/2009/June":
 
-    >>> partitioning(pa.schema([("year", pa.int16()), ("month", pa.string())]))
+    >>> import pyarrow as pa
+    >>> import pyarrow.dataset as ds
+    >>> ds.partitioning(pa.schema([("year", pa.int16()),
+    ...                            ("month", pa.string())]))
+    <pyarrow._dataset.DirectoryPartitioning object at ...>

Review Comment:
   How useful is it to see those outputs? (we could try to make the repr more informative though) 
   Because if we think that is not useful, we could also do something like `part = ..` to not have to add the output.



##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None, filesystem=None,
 
     Examples
     --------
+    Creating an example pa.Table:
+
+    >>> import pyarrow as pa
+    >>> import pyarrow.parquet as pq
+    >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                   'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                              "Brittle stars", "Centipede"]})
+    >>> pq.write_table(table, "file.parquet")
+
     Opening a single file:
 
-    >>> dataset("path/to/file.parquet", format="parquet")
+    >>> import pyarrow.dataset as ds
+    >>> dataset = ds.dataset("file.parquet", format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    year: int64
+    n_legs: int64
+    animal: string
+    ----
+    year: [[2020,2022,2021,2022,2019,2021]]
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
 
     Opening a single file with an explicit schema:
 
-    >>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
+    >>> myschema = pa.schema([
+    ...     ('n_legs', pa.int64()),
+    ...     ('animal', pa.string())])
+    >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    n_legs: int64
+    animal: string
+    ----
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
 
     Opening a dataset for a single directory:
 
-    >>> dataset("path/to/nyc-taxi/", format="parquet")
-    >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")
+    >>> ds.write_dataset(table, "partitioned_dataset", format="parquet",
+    ...                  partitioning=['year'])
+    >>> dataset = ds.dataset("partitioned_dataset", format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    n_legs: int64
+    animal: string
+    ----
+    n_legs: [[5],[2],[4,100],[2,4]]
+    animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]
+
+    >>> # Single directory from a S3 bucket
+    >>> # dataset("s3://mybucket/nyc-taxi/", format="parquet")
 
     Opening a dataset from a list of relatives local paths:
 
-    >>> dataset([
-    ...     "part0/data.parquet",
-    ...     "part1/data.parquet",
-    ...     "part3/data.parquet",
+    >>> dataset = ds.dataset([
+    ...     "partitioned_dataset/2019/part-0.parquet",
+    ...     "partitioned_dataset/2020/part-0.parquet",
+    ...     "partitioned_dataset/2021/part-0.parquet",
     ... ], format='parquet')
-
-    With filesystem provided:
-
-    >>> paths = [
-    ...     'part0/data.parquet',
-    ...     'part1/data.parquet',
-    ...     'part3/data.parquet',
-    ... ]
-    >>> dataset(paths, filesystem='file:///directory/prefix, format='parquet')

Review Comment:
   > There are some examples I removed from ds.dataset that I will add back as a follow-up when I will work on the docstring examples for [Filesystems](https://issues.apache.org/jira/browse/ARROW-16091)
   
   I would maybe keep them here for now, but add a `# doctest: +SKIP` on the lines that wouldn't yet work.
   
   (it's certainly fine to only handle them in a follow-up)



##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None, filesystem=None,
 
     Examples
     --------
+    Creating an example pa.Table:
+
+    >>> import pyarrow as pa
+    >>> import pyarrow.parquet as pq
+    >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                   'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                              "Brittle stars", "Centipede"]})
+    >>> pq.write_table(table, "file.parquet")

Review Comment:
   Wondering: those files, do they end up in the directory from where you are running pytest / where the dataset.py is located? 
   
   Because if so, we will probably want to clean them up in some way, so you don't get a bunch of files in your repo from running the doctests. 
   (it might be possible to change python's "current working directory" to a temporary directory in conftest.py?) 



##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None, filesystem=None,
 
     Examples
     --------
+    Creating an example pa.Table:
+
+    >>> import pyarrow as pa
+    >>> import pyarrow.parquet as pq
+    >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                   'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                              "Brittle stars", "Centipede"]})
+    >>> pq.write_table(table, "file.parquet")
+
     Opening a single file:
 
-    >>> dataset("path/to/file.parquet", format="parquet")
+    >>> import pyarrow.dataset as ds
+    >>> dataset = ds.dataset("file.parquet", format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    year: int64
+    n_legs: int64
+    animal: string
+    ----
+    year: [[2020,2022,2021,2022,2019,2021]]
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
 
     Opening a single file with an explicit schema:
 
-    >>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
+    >>> myschema = pa.schema([
+    ...     ('n_legs', pa.int64()),
+    ...     ('animal', pa.string())])
+    >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    n_legs: int64
+    animal: string
+    ----
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
 
     Opening a dataset for a single directory:
 
-    >>> dataset("path/to/nyc-taxi/", format="parquet")
-    >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")

Review Comment:
   For those two cases, it might also be fine to just add `  # doctest: +SKIP` (seeing the output doesn't add much value?)



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2462,26 +2460,26 @@ def read_pandas(self, **kwargs):
 
         Examples
         --------
-        Generate an example dataset:
+        Generate an example parquet file:
 
         >>> import pyarrow as pa
-        >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
-        ...                   'n_legs': [2, 2, 4, 4, 5, 100],
-        ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
-        ...                              "Brittle stars", "Centipede"]})
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+        ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+        ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+        ...                    "Brittle stars", "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
         >>> import pyarrow.parquet as pq
-        >>> pq.write_to_dataset(table, root_path='dataset_v2_read_pandas',
-        ...                     partition_cols=['year'],
-        ...                     use_legacy_dataset=False)
-        >>> dataset = pq._ParquetDatasetV2('dataset_v2_read_pandas/')
+        >>> pq.write_table(table, 'table_V2.parquet')
+        >>> dataset = pq._ParquetDatasetV2('table_V2.parquet')

Review Comment:
   ```suggestion
           >>> dataset = pq.ParquetDataset('table_V2.parquet')
   ```
   
   (we should show the public API)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r880199343


##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None, filesystem=None,
 
     Examples
     --------
+    Creating an example pa.Table:
+
+    >>> import pyarrow as pa
+    >>> import pyarrow.parquet as pq
+    >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                   'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                              "Brittle stars", "Centipede"]})
+    >>> pq.write_table(table, "file.parquet")

Review Comment:
   Yes, that is correct. These files get piled up in the dir where pytests is called from and they definitely need to get cleaned up. Will try with temp dir.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1131837028

   https://issues.apache.org/jira/browse/ARROW-16018


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

raulcd commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1132758842

   Hi @AlenkaF creating a new job sounds good to me. I have checked out your PR locally and there are a couple of things that are not clear to me. If I try to run tests locally with the current setup `python -m pytest python/pyarrow/tests/` no tests are collected:
   ```
   $ python -m pytest python/pyarrow/tests/
   ============================================================== test session starts ===============================================================
   platform linux -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0
   rootdir: /home/raulcd/open_source/arrow/python, configfile: setup.cfg
   plugins: hypothesis-6.46.3, lazy-fixture-0.6.3
   collected 0 items                                                                                                                                
   
   ============================================================= no tests ran in 0.02s ==============================================================
   ```
   This is because the new conftest skips collecting tests if we are using ` --doctest-modules` which is the default set up.
   This is solved if I run tests using: `pytest -r s -v --pyargs pyarrow` but this is not how we have it documented on the developers guide.
   
   I also have been able to reproduce the same appveyor failure if I remove the new conftest file, we might require some more investigation:
   ```
   $ mv conftest.py no_conftest.py
   $ pytest -r s -v  --pyargs pyarrow.tests
   ...
   collected 4239 items / 2 errors / 6 skipped                                                                                                      
   
   ===================================================================== ERRORS =====================================================================
   ______________________________________________ ERROR collecting pyarrow/tests/deserialize_buffer.py ______________________________________________
   tests/deserialize_buffer.py:24: in <module>
       with open(sys.argv[1], 'rb') as f:
   E   FileNotFoundError: [Errno 2] No such file or directory: '-r'
   ______________________________________________ ERROR collecting pyarrow/tests/read_record_batch.py _______________________________________________
   tests/read_record_batch.py:26: in <module>
       with open(sys.argv[1], 'rb') as f:
   E   FileNotFoundError: [Errno 2] No such file or directory: '-r'
   ============================================================ short test summary info =============================================================
   SKIPPED [2] tests/test_cuda.py:32: could not import 'pyarrow.cuda': No module named 'pyarrow._cuda'
   SKIPPED [2] tests/test_cuda_numba_interop.py:23: could not import 'pyarrow.cuda': No module named 'pyarrow._cuda'
   SKIPPED [2] tests/test_jvm.py:27: could not import 'jpype': No module named 'jpype'
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   ========================================================== 6 skipped, 2 errors in 2.69s ==========================================================
   ```
   I would propose to not add `--doctest-modules` to the default pytest setup, this also solved the appveyor issue I was able to experience locally.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1132775935

Thanks @raulcd for reviewing and testing locally!

For the first issue: `python -m pytest python/pyarrow/tests/` doesn't collect any tests which is correct. The skip of all the tests is added to the conftest file on purpose as this feature should only check the docstring examples in .py files and not the unit tests. Could you try to run `python -m pytest python/pyarrow/` and let me know how it goes?

For the second issue. The errors that you get when removing newly added conftest file are due to the fact that doctest collects all the .py files from pyarrow to run doctest on. Because some modules were not build (in you case cuda for example) doctest complains. For this reason new conftest file was added, to check which modules are not installed and tell doctest to skip the files that are connected to this missing modules.

The issue was solved by removing `--doctest--modules` from pytest setup because in that case doctest didn't run at all. So we need the conftest file to skip the files connected to the missing modules and then we should move `--doctest-modules` from pytest setup to a new job where we would add something similar to `pytest --doctest-modules python/pyarrow/`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r879014069


##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,104 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import pytest
+
+
+groups = [
+    'cuda',
+    'dataset',
+    'orc',
+    'parquet',
+    'parquet/encryption',
+    'plasma',
+    'flight',
+    'fs',
+]
+
+defaults = {
+    'cuda': False,
+    'dataset': False,
+    'flight': False,
+    'orc': False,
+    'parquet': False,
+    'parquet/encryption': False,
+    'plasma': False,
+    'fs': False,
+}
+
+try:
+    import pyarrow.cuda # noqa
+    defaults['cuda'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet/encryption'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['fs'] = True

Review Comment:
   I think there is just one (line)
   https://github.com/apache/arrow/blob/d4a7638477769e788030af1e75d55873a92b770b/python/pyarrow/fs.py#L227



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)
URL: https://github.com/apache/arrow/pull/13199


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1137735102

   Benchmark runs are scheduled for baseline = fe2ce209794e810fa939ef5ea0e5b22c6720a725 and contender = 3b92f0279dd69b1120b1623cf1e98f8b559f7762. 3b92f0279dd69b1120b1623cf1e98f8b559f7762 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/acdb87ea0c884e849c99a497f6a2025a...d1b262f73c1248cfad60fac9cd69047d/)
   [Failed :arrow_down:0.19% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e5782939c7c743e193b1b05d6b96dbc8...709bd99ed48543ad9388de764b431233/)
   [Failed :arrow_down:0.37% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/235ae860cdad4d0b98b035303a4c4eae...b38cbb0a04794f9bbfd59d4b5f150a2d/)
   [Finished :arrow_down:0.08% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/72b7c80b9721416bb698e555b162be57...8e8692cb9133483db591bef9b65d3382/)
   Buildkite builds:
   [Finished] [`3b92f027` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/821)
   [Failed] [`3b92f027` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/820)
   [Failed] [`3b92f027` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/810)
   [Finished] [`3b92f027` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/824)
   [Finished] [`fe2ce209` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/820)
   [Failed] [`fe2ce209` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/819)
   [Failed] [`fe2ce209` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/809)
   [Finished] [`fe2ce209` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/823)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1132545732

   There are some examples I removed from `ds.dataset` that I will add back as a follow-up when I will work on the docstring examples for [Filesystems](https://issues.apache.org/jira/browse/ARROW-16091) as they include, for example, reading from an S3 bucket and so I need a bit more time to find a good solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1134165944

   I will do the CI one today and then we can decide which PR to close first =)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r881412652


##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pytest
+from pyarrow import Codec
+
+groups = [
+    'brotli',
+    'bz2',
+    'cython',
+    'dataset',
+    'hypothesis',
+    'fastparquet',
+    'gandiva',
+    'gdb',
+    'gzip',
+    'hdfs',
+    'large_memory',
+    'lz4',
+    'memory_leak',
+    'nopandas',
+    'orc',
+    'pandas',
+    'parquet',
+    'parquet_encryption',
+    'plasma',
+    's3',
+    'snappy',
+    'substrait',
+    'tensorflow',
+    'flight',
+    'slow',
+    'requires_testing_data',
+    'zstd',
+]
+
+defaults = {
+    'brotli': Codec.is_available('brotli'),
+    'bz2': Codec.is_available('bz2'),
+    'cython': False,
+    'dataset': False,
+    'fastparquet': False,
+    'flight': False,
+    'gandiva': False,
+    'gdb': True,
+    'gzip': Codec.is_available('gzip'),
+    'hdfs': False,
+    'hypothesis': False,
+    'large_memory': False,
+    'lz4': Codec.is_available('lz4'),
+    'memory_leak': False,
+    'nopandas': False,
+    'orc': False,
+    'pandas': False,
+    'parquet': False,
+    'parquet_encryption': False,
+    'plasma': False,
+    'requires_testing_data': True,
+    's3': False,
+    'slow': False,
+    'snappy': Codec.is_available('snappy'),
+    'substrait': False,
+    'tensorflow': False,
+    'zstd': Codec.is_available('zstd'),
+}
+
+try:
+    import cython  # noqa
+    defaults['cython'] = True
+except ImportError:
+    pass
+
+try:
+    import fastparquet  # noqa
+    defaults['fastparquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.gandiva  # noqa
+    defaults['gandiva'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pandas  # noqa
+    defaults['pandas'] = True
+except ImportError:
+    defaults['nopandas'] = True
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet_encryption'] = True
+except ImportError:
+    pass
+
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import tensorflow  # noqa
+    defaults['tensorflow'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['s3'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import HadoopFileSystem  # noqa
+    defaults['hdfs'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.substrait  # noqa
+    defaults['substrait'] = True
+except ImportError:
+    pass
+
+
+# Doctest should ignore files for the modules that are not built
+def pytest_ignore_collect(path, config):
+    if config.option.doctestmodules:
+        # don't try to run doctests on the /tests directory
+        if "/pyarrow/tests/" in str(path):
+            return True
+
+        doctest_groups = [
+            'dataset',
+            'orc',
+            'parquet',
+            'plasma',
+            'flight',
+        ]
+
+        # handle cuda, flight, etc
+        for group in doctest_groups:
+            if 'pyarrow/{}'.format(group) in str(path) and \
+               not defaults[group]:
+                return True
+
+        if 'pyarrow/parquet/encryption' in str(path) and \
+           not defaults['parquet_encryption']:
+            return True
+
+        if 'pyarrow/cuda' in str(path):
+            try:
+                import pyarrow.cuda  # noqa
+                return False
+            except ImportError:
+                return True
+
+        if 'pyarrow/fs' in str(path):
+            try:
+                from pyarrow.fs import S3FileSystem  # noqa
+                return False
+            except ImportError:
+                return True
+
+    return False
+
+
+# Save output files from doctest examples into temp dir
+@pytest.fixture(autouse=True)

Review Comment:
   For a follow-up, it might be possible use a dynamic scope to ensure we don't have to run this for every test in `/tests` (in case that adds runtime to our tests). 
   Or alternatively, we would also override this fixture in `/tests/conftest.py` to be a no-op fixture, and then all tests in `/tests` should automatically use that version of the fixture.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1132609471

   @raulcd @jorisvandenbossche the code now works for `--doctest-modules` (not sure about AppVeyor pyarrow test error ...). I would suggest creating a new workflow task that only runs the doctests in a separate PR following what Raul suggested when we talked.  There I would remove `addopts` option for `doctest` from `setup.cfg`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1132841663

   Thanks for helping me understand your comment better Raul and sorry for not seeing the problem! Now I get what Joris was trying to tell me yesterday also :)
   
   Yes, tests are totally being skipped due to this change and that's not good at all. 
   
   I will remove `--doctest-modules` from pytest setup and this PR will be ready I think. Then I will make a separate one for `--doctest-cython` similar to this PR and the last one will be the PR for the CI job. @jorisvandenbossche what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1135513284

   > Further, I think we should still take a look whether we can deduplicate some content of the conftest.py files:
   
   > We should maybe see if it would be possible to de-duplicate the groups/defaults definitions in both conftest.py files (I suppose that if all of them are defined in the top-level conftest.py, that should be fine for the tests as well)
   
   Yes, totally agree. Had it in mind. There is some small differences but I will try to put the code for groups/defaults definitions together in the top level (pyarrow) conftest.py file and leave the rest as is.
   
   I will also check other comments and make changes. Thanks for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1137034669

   If the checks pass, this should be ready to merge @jorisvandenbossche.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r877287548


##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,104 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import pytest
+
+
+groups = [
+    'cuda',
+    'dataset',
+    'orc',
+    'parquet',
+    'parquet/encryption',
+    'plasma',
+    'flight',
+    'fs',
+]
+
+defaults = {
+    'cuda': False,
+    'dataset': False,
+    'flight': False,
+    'orc': False,
+    'parquet': False,
+    'parquet/encryption': False,
+    'plasma': False,
+    'fs': False,
+}
+
+try:
+    import pyarrow.cuda # noqa
+    defaults['cuda'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet/encryption'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['fs'] = True

Review Comment:
   Are there S3 examples in fs.py?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1132967685

   > I will remove `--doctest-modules` from pytest setup and this PR will be ready I think. Then I will make a separate one for `--doctest-cython` similar to this PR and the last one will be the PR for the CI job. @jorisvandenbossche what do you think?
   
   Yes, that sounds good. Although I think you can maybe do the CI job one before tackling the cython doctests; in that way we can directly test it in CI in the cython PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r879014254


##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,104 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import pytest
+
+
+groups = [
+    'cuda',
+    'dataset',
+    'orc',
+    'parquet',
+    'parquet/encryption',
+    'plasma',
+    'flight',
+    'fs',
+]
+
+defaults = {
+    'cuda': False,
+    'dataset': False,
+    'flight': False,
+    'orc': False,
+    'parquet': False,
+    'parquet/encryption': False,
+    'plasma': False,
+    'fs': False,
+}
+
+try:
+    import pyarrow.cuda # noqa
+    defaults['cuda'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet/encryption'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['fs'] = True

Review Comment:
   Realised I will have to check these examples also (as they do not get checked locally for me at the moment).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r881414255


##########
python/pyarrow/dataset.py:
##########
@@ -622,26 +624,75 @@ def dataset(source, schema=None, format=None, filesystem=None,
 
     Examples
     --------
+    Creating an example Table:
+
+    >>> import pyarrow as pa
+    >>> import pyarrow.parquet as pq
+    >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                   'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                              "Brittle stars", "Centipede"]})
+    >>> pq.write_table(table, "file.parquet")
+
     Opening a single file:
 
-    >>> dataset("path/to/file.parquet", format="parquet")
+    >>> import pyarrow.dataset as ds
+    >>> dataset = ds.dataset("file.parquet", format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    year: int64
+    n_legs: int64
+    animal: string
+    ----
+    year: [[2020,2022,2021,2022,2019,2021]]
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
 
     Opening a single file with an explicit schema:
 
-    >>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
+    >>> myschema = pa.schema([
+    ...     ('n_legs', pa.int64()),
+    ...     ('animal', pa.string())])
+    >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    n_legs: int64
+    animal: string
+    ----
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
 
     Opening a dataset for a single directory:
 
-    >>> dataset("path/to/nyc-taxi/", format="parquet")
-    >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")
+    >>> ds.write_dataset(table, "partitioned_dataset", format="parquet",
+    ...                  partitioning=['year'])
+    >>> dataset = ds.dataset("partitioned_dataset", format="parquet")
+    >>> dataset.to_table()
+    pyarrow.Table
+    n_legs: int64
+    animal: string
+    ----
+    n_legs: [[5],[2],[4,100],[2,4]]
+    animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]
+
+    For a single directory from a S3 bucket:
+
+    >>> ds.dataset("s3://mybucket/nyc-taxi/", format="parquet")# doctest: +SKIP

Review Comment:
   ```suggestion
       >>> ds.dataset("s3://mybucket/nyc-taxi/", format="parquet")  # doctest: +SKIP
   ```



##########
python/pyarrow/dataset.py:
##########
@@ -155,18 +155,21 @@ def partitioning(schema=None, field_names=None, flavor=None,
 
     Specify the Schema for paths like "/2009/June":
 
-    >>> partitioning(pa.schema([("year", pa.int16()), ("month", pa.string())]))
+    >>> import pyarrow as pa
+    >>> import pyarrow.dataset as ds
+    >>> part = ds.partitioning(pa.schema([("year", pa.int16()),
+    ...                            ("month", pa.string())]))

Review Comment:
   ```suggestion
       >>> part = ds.partitioning(pa.schema([("year", pa.int16()),
       ...                                   ("month", pa.string())]))
   ```



##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pytest
+from pyarrow import Codec
+
+groups = [
+    'brotli',
+    'bz2',
+    'cython',
+    'dataset',
+    'hypothesis',
+    'fastparquet',
+    'gandiva',
+    'gdb',
+    'gzip',
+    'hdfs',
+    'large_memory',
+    'lz4',
+    'memory_leak',
+    'nopandas',
+    'orc',
+    'pandas',
+    'parquet',
+    'parquet_encryption',
+    'plasma',
+    's3',
+    'snappy',
+    'substrait',
+    'tensorflow',
+    'flight',
+    'slow',
+    'requires_testing_data',
+    'zstd',
+]
+
+defaults = {
+    'brotli': Codec.is_available('brotli'),
+    'bz2': Codec.is_available('bz2'),
+    'cython': False,
+    'dataset': False,
+    'fastparquet': False,
+    'flight': False,
+    'gandiva': False,
+    'gdb': True,
+    'gzip': Codec.is_available('gzip'),
+    'hdfs': False,
+    'hypothesis': False,
+    'large_memory': False,
+    'lz4': Codec.is_available('lz4'),
+    'memory_leak': False,
+    'nopandas': False,
+    'orc': False,
+    'pandas': False,
+    'parquet': False,
+    'parquet_encryption': False,
+    'plasma': False,
+    'requires_testing_data': True,
+    's3': False,
+    'slow': False,
+    'snappy': Codec.is_available('snappy'),
+    'substrait': False,
+    'tensorflow': False,
+    'zstd': Codec.is_available('zstd'),
+}
+
+try:
+    import cython  # noqa
+    defaults['cython'] = True
+except ImportError:
+    pass
+
+try:
+    import fastparquet  # noqa
+    defaults['fastparquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.gandiva  # noqa
+    defaults['gandiva'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pandas  # noqa
+    defaults['pandas'] = True
+except ImportError:
+    defaults['nopandas'] = True
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet_encryption'] = True
+except ImportError:
+    pass
+
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import tensorflow  # noqa
+    defaults['tensorflow'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['s3'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import HadoopFileSystem  # noqa
+    defaults['hdfs'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.substrait  # noqa
+    defaults['substrait'] = True
+except ImportError:
+    pass
+
+
+# Doctest should ignore files for the modules that are not built
+def pytest_ignore_collect(path, config):
+    if config.option.doctestmodules:
+        # don't try to run doctests on the /tests directory
+        if "/pyarrow/tests/" in str(path):
+            return True
+
+        doctest_groups = [
+            'dataset',
+            'orc',
+            'parquet',
+            'plasma',
+            'flight',
+        ]
+
+        # handle cuda, flight, etc
+        for group in doctest_groups:
+            if 'pyarrow/{}'.format(group) in str(path) and \
+               not defaults[group]:
+                return True

Review Comment:
   ```suggestion
               if 'pyarrow/{}'.format(group) in str(path):
                   if not defaults[group]:
                       return True
   ```
   
   Possible alternative formatting (which avoids the `\` and the strange indentation)



##########
python/pyarrow/conftest.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pytest
+from pyarrow import Codec
+
+groups = [
+    'brotli',
+    'bz2',
+    'cython',
+    'dataset',
+    'hypothesis',
+    'fastparquet',
+    'gandiva',
+    'gdb',
+    'gzip',
+    'hdfs',
+    'large_memory',
+    'lz4',
+    'memory_leak',
+    'nopandas',
+    'orc',
+    'pandas',
+    'parquet',
+    'parquet_encryption',
+    'plasma',
+    's3',
+    'snappy',
+    'substrait',
+    'tensorflow',
+    'flight',
+    'slow',
+    'requires_testing_data',
+    'zstd',
+]
+
+defaults = {
+    'brotli': Codec.is_available('brotli'),
+    'bz2': Codec.is_available('bz2'),
+    'cython': False,
+    'dataset': False,
+    'fastparquet': False,
+    'flight': False,
+    'gandiva': False,
+    'gdb': True,
+    'gzip': Codec.is_available('gzip'),
+    'hdfs': False,
+    'hypothesis': False,
+    'large_memory': False,
+    'lz4': Codec.is_available('lz4'),
+    'memory_leak': False,
+    'nopandas': False,
+    'orc': False,
+    'pandas': False,
+    'parquet': False,
+    'parquet_encryption': False,
+    'plasma': False,
+    'requires_testing_data': True,
+    's3': False,
+    'slow': False,
+    'snappy': Codec.is_available('snappy'),
+    'substrait': False,
+    'tensorflow': False,
+    'zstd': Codec.is_available('zstd'),
+}
+
+try:
+    import cython  # noqa
+    defaults['cython'] = True
+except ImportError:
+    pass
+
+try:
+    import fastparquet  # noqa
+    defaults['fastparquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.gandiva  # noqa
+    defaults['gandiva'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.dataset  # noqa
+    defaults['dataset'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.orc  # noqa
+    defaults['orc'] = True
+except ImportError:
+    pass
+
+try:
+    import pandas  # noqa
+    defaults['pandas'] = True
+except ImportError:
+    defaults['nopandas'] = True
+
+try:
+    import pyarrow.parquet  # noqa
+    defaults['parquet'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.parquet.encryption  # noqa
+    defaults['parquet_encryption'] = True
+except ImportError:
+    pass
+
+
+try:
+    import pyarrow.plasma  # noqa
+    defaults['plasma'] = True
+except ImportError:
+    pass
+
+try:
+    import tensorflow  # noqa
+    defaults['tensorflow'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.flight  # noqa
+    defaults['flight'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import S3FileSystem  # noqa
+    defaults['s3'] = True
+except ImportError:
+    pass
+
+try:
+    from pyarrow.fs import HadoopFileSystem  # noqa
+    defaults['hdfs'] = True
+except ImportError:
+    pass
+
+try:
+    import pyarrow.substrait  # noqa
+    defaults['substrait'] = True
+except ImportError:
+    pass
+
+
+# Doctest should ignore files for the modules that are not built
+def pytest_ignore_collect(path, config):
+    if config.option.doctestmodules:
+        # don't try to run doctests on the /tests directory
+        if "/pyarrow/tests/" in str(path):
+            return True
+
+        doctest_groups = [
+            'dataset',
+            'orc',
+            'parquet',
+            'plasma',
+            'flight',
+        ]
+
+        # handle cuda, flight, etc
+        for group in doctest_groups:
+            if 'pyarrow/{}'.format(group) in str(path) and \
+               not defaults[group]:
+                return True
+
+        if 'pyarrow/parquet/encryption' in str(path) and \
+           not defaults['parquet_encryption']:
+            return True
+
+        if 'pyarrow/cuda' in str(path):
+            try:
+                import pyarrow.cuda  # noqa
+                return False
+            except ImportError:
+                return True
+
+        if 'pyarrow/fs' in str(path):
+            try:
+                from pyarrow.fs import S3FileSystem  # noqa
+                return False
+            except ImportError:
+                return True
+
+    return False
+
+
+# Save output files from doctest examples into temp dir
+@pytest.fixture(autouse=True)
+def _docdir(request):
+
+    # Trigger ONLY for the doctests.
+    doctest_plugin = request.config.pluginmanager.getplugin("doctest")

Review Comment:
   Would `request.config.option.doctestmodules` work here as well? (to keep it similar as how we checked for it above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13199: ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules)

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13199:
URL: https://github.com/apache/arrow/pull/13199#issuecomment-1137015681

   Thanks! I would address the comments in this PR.
   For the fixture scope follow-up, I will see if I will manage to put it into https://github.com/apache/arrow/pull/13216. Otherwise I will create a JIRA for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org