You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Kevin Crouse <kr...@gmail.com> on 2022/04/26 02:08:50 UTC

[Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

Hi everyone,

Sorry if some of this is out of place or not in the right dev email
structure. I've only recently started getting into the arrow dev stuff.

*Summary*: I'm interested in improving the API and functional
documentation, especially for the pyarrow compute functions as I've been
doing some deep implementation and finding issues with some doc examples
being wrong and most functions not having examples. I think this is mostly
because the function docs are inherited from the cpp tree. That makes sense
to keep things in sync with the C++ library, which is the ultimate source.

*Why*: There is almost no way to customize the documentation to reflect the
pythonic features and abilities. There is a very small ability to provide
additional information by writing a docstring appendix in
python/pyarrow/_compute_docstrings.py, but nothing to modify or add to the
description or change the details on the parameters. I feel this is not
ideal because:

   1.  it only adds supplemental details,
   2. there's no easy way to test example code as you are writing it,
   3. Trying to figure out where any given documentation comes from (in
   order to improve it) really requires you to trace your way through a lot of
   modules
   4. it feels unnecessarily complex. In order to add an example, we write
   reSt-style docstring parts into a python module just so it get be
   reconstituted into a regular functional docstring in another module
   (python/pyarrow/compute.py), and then that is used to build actual reSt
   docs when docs are built

*Proposal?*:
How about having a subdirectory for doc additions written in reStructured
text that looks a lot like regular functional docs. This provides a single,
easy to find location for the custom python docs (solving #3 and some of
#4) and examples can be tested with doctest (solving #2). Then, write a
function to parse the reSt file and use the details there to merge with the
function docs pulled in from the cpp library function docs in
python/pyarrow/compute.py - so this flexibly lets us add examples, notes,
or extra python-specific additions easily (solving #1).  AND, in cases when
a parameter is defined in the reSt addition file, it will supplant the text
pulled from the cpp tree - but if there's no need to provide extra details,
not including a Parameters section just defaults to the current cpp docs.

I realize that may all be hard to follow, especially if you haven't been
deep in the python docs. I quickly threw together a prototype if this
sounds like a useful path forward.

Best,

Kevin

Re: [Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

Posted by Kevin Crouse <kr...@gmail.com>.
Hi Antoine (and all),

Thanks for your thoughts.  I'll finish up the prototype and share my
branch. It also wouldn't increase the time to import pyarrow by itself. but
the rough idea would increase the import of pyarrow.compute as it's
currently written, but more on that below. The only external library import
it uses is docutils, which I might argue is established enough to be
acceptable.

Issue 12526 is interesting - and would also eliminate the docutils
dependency and eliminate the time to parse the rst. Is there any movement
on that? Also, I saw your comment about including the generated code in the
git repo. What's the benefit to that? For work projects, I explicitly
reject any PRs from my team not if they include generated code in the
repository to avoid other team members inadvertently attempting to add
features to something that will be stomped on during the next build.

Regarding the load time for pyarrow.compute - python/pyarrow/compute.py
currently calls `_make_global_functions()` globally, which generates all
the function docs on module load. If load time is a chief concern, it seems
like that's already a bad move as it's rare that the __doc__ info for a
function (let alone all the functions) are accessed during a given run. And
as the first pass that I'll share is written, it would be slower as it
naively parses the rst file when building the function wrappers. If
pursued, though, there are a number of ways to get around that. The rst
could be precompiled into the build and/or the pc functions could
lazy-build the docs if they are actually requested during runtime (or
both). All of that is moot if Issue 12526 is in the works, though.

Best,

Kevin




On Tue, Apr 26, 2022 at 3:29 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Kevin,
>
> There are a couple of concerns to keep in mind:
> - we don't want to increase the import time of PyArrow too much
> - we would like to limit the required runtime dependencies for PyArrow
>
> (an issue is open to move docstring generation at package build time:
> https://issues.apache.org/jira/browse/ARROW-12526)
>
> As for your proposal, it sounds like an interesting idea but the devil
> may lie in the details, so it would be good to see an actual
> implementation.
>
> Regards
>
> Antoine.
>
>
>
> Le 26/04/2022 à 04:08, Kevin Crouse a écrit :
> > Hi everyone,
> >
> > Sorry if some of this is out of place or not in the right dev email
> > structure. I've only recently started getting into the arrow dev stuff.
> >
> > *Summary*: I'm interested in improving the API and functional
> > documentation, especially for the pyarrow compute functions as I've been
> > doing some deep implementation and finding issues with some doc examples
> > being wrong and most functions not having examples. I think this is
> mostly
> > because the function docs are inherited from the cpp tree. That makes
> sense
> > to keep things in sync with the C++ library, which is the ultimate
> source.
> >
> > *Why*: There is almost no way to customize the documentation to reflect
> the
> > pythonic features and abilities. There is a very small ability to provide
> > additional information by writing a docstring appendix in
> > python/pyarrow/_compute_docstrings.py, but nothing to modify or add to
> the
> > description or change the details on the parameters. I feel this is not
> > ideal because:
> >
> >     1.  it only adds supplemental details,
> >     2. there's no easy way to test example code as you are writing it,
> >     3. Trying to figure out where any given documentation comes from (in
> >     order to improve it) really requires you to trace your way through a
> lot of
> >     modules
> >     4. it feels unnecessarily complex. In order to add an example, we
> write
> >     reSt-style docstring parts into a python module just so it get be
> >     reconstituted into a regular functional docstring in another module
> >     (python/pyarrow/compute.py), and then that is used to build actual
> reSt
> >     docs when docs are built
> >
> > *Proposal?*:
> > How about having a subdirectory for doc additions written in reStructured
> > text that looks a lot like regular functional docs. This provides a
> single,
> > easy to find location for the custom python docs (solving #3 and some of
> > #4) and examples can be tested with doctest (solving #2). Then, write a
> > function to parse the reSt file and use the details there to merge with
> the
> > function docs pulled in from the cpp library function docs in
> > python/pyarrow/compute.py - so this flexibly lets us add examples, notes,
> > or extra python-specific additions easily (solving #1).  AND, in cases
> when
> > a parameter is defined in the reSt addition file, it will supplant the
> text
> > pulled from the cpp tree - but if there's no need to provide extra
> details,
> > not including a Parameters section just defaults to the current cpp docs.
> >
> > I realize that may all be hard to follow, especially if you haven't been
> > deep in the python docs. I quickly threw together a prototype if this
> > sounds like a useful path forward.
> >
> > Best,
> >
> > Kevin
> >
>

Re: [Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

Posted by Antoine Pitrou <an...@python.org>.
Hi Kevin,

There are a couple of concerns to keep in mind:
- we don't want to increase the import time of PyArrow too much
- we would like to limit the required runtime dependencies for PyArrow

(an issue is open to move docstring generation at package build time:
https://issues.apache.org/jira/browse/ARROW-12526)

As for your proposal, it sounds like an interesting idea but the devil 
may lie in the details, so it would be good to see an actual implementation.

Regards

Antoine.



Le 26/04/2022 à 04:08, Kevin Crouse a écrit :
> Hi everyone,
> 
> Sorry if some of this is out of place or not in the right dev email
> structure. I've only recently started getting into the arrow dev stuff.
> 
> *Summary*: I'm interested in improving the API and functional
> documentation, especially for the pyarrow compute functions as I've been
> doing some deep implementation and finding issues with some doc examples
> being wrong and most functions not having examples. I think this is mostly
> because the function docs are inherited from the cpp tree. That makes sense
> to keep things in sync with the C++ library, which is the ultimate source.
> 
> *Why*: There is almost no way to customize the documentation to reflect the
> pythonic features and abilities. There is a very small ability to provide
> additional information by writing a docstring appendix in
> python/pyarrow/_compute_docstrings.py, but nothing to modify or add to the
> description or change the details on the parameters. I feel this is not
> ideal because:
> 
>     1.  it only adds supplemental details,
>     2. there's no easy way to test example code as you are writing it,
>     3. Trying to figure out where any given documentation comes from (in
>     order to improve it) really requires you to trace your way through a lot of
>     modules
>     4. it feels unnecessarily complex. In order to add an example, we write
>     reSt-style docstring parts into a python module just so it get be
>     reconstituted into a regular functional docstring in another module
>     (python/pyarrow/compute.py), and then that is used to build actual reSt
>     docs when docs are built
> 
> *Proposal?*:
> How about having a subdirectory for doc additions written in reStructured
> text that looks a lot like regular functional docs. This provides a single,
> easy to find location for the custom python docs (solving #3 and some of
> #4) and examples can be tested with doctest (solving #2). Then, write a
> function to parse the reSt file and use the details there to merge with the
> function docs pulled in from the cpp library function docs in
> python/pyarrow/compute.py - so this flexibly lets us add examples, notes,
> or extra python-specific additions easily (solving #1).  AND, in cases when
> a parameter is defined in the reSt addition file, it will supplant the text
> pulled from the cpp tree - but if there's no need to provide extra details,
> not including a Parameters section just defaults to the current cpp docs.
> 
> I realize that may all be hard to follow, especially if you haven't been
> deep in the python docs. I quickly threw together a prototype if this
> sounds like a useful path forward.
> 
> Best,
> 
> Kevin
>