You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ian Joiner <ia...@gmail.com> on 2021/11/24 03:55:30 UTC

[Doc] ORC-related documentation

Hi,

Today I found that pretty much none of our ORC-related work (e.g. ORC
writer in C++ & Python, Arrow Dataset with ORC) has ever been documented.
This is something we have to fix or users won’t even be aware that ORC
support exists, let alone how to use it.

From my understanding it seems that we miss the following docs:
1. C++ and Python API reference (partially missing)
2. User Guide (entirely absent)

As the person who created and self-assigned
https://issues.apache.org/jira/browse/ARROW-13231 I’d like to spend the
next a couple of days fixing it. Could you guys please point me towards
what actually needs to be revised? In particular where is the source of
https://arrow.apache.org/docs/python/parquet.html ?
Really thanks!

Ian

Re: [Doc] ORC-related documentation

Posted by Ian Joiner <ia...@gmail.com>.
Hi Joris,

Really thanks for pointing out where the doc sources are! I will start the
PR and share it with you so that we can work on it together. You know, I
can do the ORC reader & writer with options and you can do the dataset
integration that you did.

Best,
Ian

On Thursday, November 25, 2021, Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Ian,
>
> Yes, more documentation regarding ORC would be very welcome! I think
> your list of missing docs is correct:
>
> - It's briefly mentioned in the Python API docs
> (https://arrow.apache.org/docs/python/api/formats.html#orc-files), but
> incomplete
> - The C++ reference docs list the OrcFileFormat for the dataset API
> (https://arrow.apache.org/docs/cpp/api/dataset.html#_
> CPPv4N5arrow7dataset13OrcFileFormatE),
> but not the direct ORC interface (like is done for Parquet at
> https://arrow.apache.org/docs/cpp/api/formats.html, for which the
> source lives at
> https://github.com/apache/arrow/blob/master/docs/source/
> cpp/api/formats.rst)
> - There is indeed no user guide. The Parquet python doc page lives at
> https://github.com/apache/arrow/blob/master/docs/source/python/parquet.rst
>
> Best,
> Joris
>
> On Wed, 24 Nov 2021 at 04:55, Ian Joiner <ia...@gmail.com> wrote:
> >
> > Hi,
> >
> > Today I found that pretty much none of our ORC-related work (e.g. ORC
> > writer in C++ & Python, Arrow Dataset with ORC) has ever been documented.
> > This is something we have to fix or users won’t even be aware that ORC
> > support exists, let alone how to use it.
> >
> > From my understanding it seems that we miss the following docs:
> > 1. C++ and Python API reference (partially missing)
> > 2. User Guide (entirely absent)
> >
> > As the person who created and self-assigned
> > https://issues.apache.org/jira/browse/ARROW-13231 I’d like to spend the
> > next a couple of days fixing it. Could you guys please point me towards
> > what actually needs to be revised? In particular where is the source of
> > https://arrow.apache.org/docs/python/parquet.html ?
> > Really thanks!
> >
> > Ian
>

Re: [Doc] ORC-related documentation

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi Ian,

Yes, more documentation regarding ORC would be very welcome! I think
your list of missing docs is correct:

- It's briefly mentioned in the Python API docs
(https://arrow.apache.org/docs/python/api/formats.html#orc-files), but
incomplete
- The C++ reference docs list the OrcFileFormat for the dataset API
(https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset13OrcFileFormatE),
but not the direct ORC interface (like is done for Parquet at
https://arrow.apache.org/docs/cpp/api/formats.html, for which the
source lives at
https://github.com/apache/arrow/blob/master/docs/source/cpp/api/formats.rst)
- There is indeed no user guide. The Parquet python doc page lives at
https://github.com/apache/arrow/blob/master/docs/source/python/parquet.rst

Best,
Joris

On Wed, 24 Nov 2021 at 04:55, Ian Joiner <ia...@gmail.com> wrote:
>
> Hi,
>
> Today I found that pretty much none of our ORC-related work (e.g. ORC
> writer in C++ & Python, Arrow Dataset with ORC) has ever been documented.
> This is something we have to fix or users won’t even be aware that ORC
> support exists, let alone how to use it.
>
> From my understanding it seems that we miss the following docs:
> 1. C++ and Python API reference (partially missing)
> 2. User Guide (entirely absent)
>
> As the person who created and self-assigned
> https://issues.apache.org/jira/browse/ARROW-13231 I’d like to spend the
> next a couple of days fixing it. Could you guys please point me towards
> what actually needs to be revised? In particular where is the source of
> https://arrow.apache.org/docs/python/parquet.html ?
> Really thanks!
>
> Ian