You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/02 11:30:32 UTC

[GitHub] [arrow] AlenkaF opened a new pull request #12543: ARROW-15432: [Python] Address CSV docstrings

AlenkaF opened a new pull request #12543:
URL: https://github.com/apache/arrow/pull/12543


   This PR adds docstring examples to:
   - `pyarrow.csv.write_csv`
   - `pyarrow.csv.read_csv`
   - `pyarrow.csv.ReadOptions` except `block_size`
   - `pyarrow.csv.ParseOptions` except `invalid_row_handler`
   - `pyarrow.csv.ConvertOptions`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1061786598


   @pitrou @jorisvandenbossche I have restructured the examples to be under `read_csv` and `write_csv`. Didn't make an example for all the options as the docstring is long already and I think it is not hard to use what's written for missing options also.
   
   So this should be ready for another round of review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1078832859


   I have build the docs to check the html version, checked the docstrings in IPython and also ran doctest on the _python/pyarrow/_csv.pyx_ to check the examples.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1063864657


   @AlenkaF I fear this makes the docstring overly long and will make displaying a bit uncomfortable. As a reminder, the docstring can be displayed in its entirety on the interpreter prompt or on generated Sphinx pages (perhaps also in IDEs?). 
   
   @jorisvandenbossche What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r837273531



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -938,6 +1169,34 @@ def read_csv(input_file, read_options=None, parse_options=None,
     -------
     :class:`pyarrow.Table`
         Contents of the CSV file as a in-memory table.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = "animals,n_legs,entry\\nFlamingo,2,2022-03-01\\nHorse,4,2022-03-02\\nBrittle stars,5,2022-03-03\\nCentipede,100,2022-03-04"
+    >>> print(s)
+    animals,n_legs,entry
+    Flamingo,2,2022-03-01
+    Horse,4,2022-03-02
+    Brittle stars,5,2022-03-03
+    Centipede,100,2022-03-04
+    >>> source = io.BytesIO(s.encode())
+
+    Reading from the file
+
+    >>> from pyarrow import csv
+    >>> csv.read_csv(source)

Review comment:
       Hope this doesn't confuse others also!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r837254686



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -938,6 +1169,34 @@ def read_csv(input_file, read_options=None, parse_options=None,
     -------
     :class:`pyarrow.Table`
         Contents of the CSV file as a in-memory table.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = "animals,n_legs,entry\\nFlamingo,2,2022-03-01\\nHorse,4,2022-03-02\\nBrittle stars,5,2022-03-03\\nCentipede,100,2022-03-04"
+    >>> print(s)
+    animals,n_legs,entry
+    Flamingo,2,2022-03-01
+    Horse,4,2022-03-02
+    Brittle stars,5,2022-03-03
+    Centipede,100,2022-03-04
+    >>> source = io.BytesIO(s.encode())
+
+    Reading from the file
+
+    >>> from pyarrow import csv
+    >>> csv.read_csv(source)

Review comment:
       Whoops, missed that :) I only looked at the `s` definition




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r832261089



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Yes, agree. I hope I get a good idea soon - currently I am not sure what would be the best approach.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r829120331



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column

Review comment:
       ```suggestion
       Define a date parsing format to get a timestamp type column
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])

Review comment:
       A typical example for "strings_can_be_null" is where you have data that uses a empty slot for missing values (eg pandas does this by default for writing data):
   
   ```
   In [5]: data = b"a,b\nA,1\n,2\nC,3"
   
   In [6]: print(data.decode())
   a,b
   A,1
   ,2
   C,3
   
   In [9]: csv.read_csv(io.BytesIO(data))
   Out[9]: 
   pyarrow.Table
   a: string
   b: int64
   ----
   a: [["A","","C"]]
   b: [[1,2,3]]
   
   In [10]: csv.read_csv(io.BytesIO(data), convert_options=csv.ConvertOptions(strings_can_be_null=True))
   Out[10]: 
   pyarrow.Table
   a: string
   b: int64
   ----
   a: [["A",null,"C"]]
   b: [[1,2,3]]
   
   
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:

Review comment:
       It makes it longer, but it might be clearer to have two examples here: first only read a subset of columns, and then show you can also list more columns and get them included as null typed column.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/404| `be45ec60` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/390| `be45ec60` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/390| `be45ec60` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/400| `be45ec60` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/403| `7de798a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/389| `7de798a0` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/389| `7de798a0` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/399| `7de798a0` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r818647828



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -178,6 +183,33 @@ cdef class ReadOptions(_Weakrefable):
         The number of rows to skip before the column names (if any)
         and the CSV data.
         See `skip_rows_after_names` for interaction description
+
+        Examples:
+        ---------
+        >>> from pyarrow import csv

Review comment:
       We were just talking today with Joris that I should put all the examples under the `csv.read_csv()` and so this will not be repeating.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1064894467


   @pitrou I shuffled examples a bit and moved them under Options. Still quite long but if users search for Options docstring and not so much for Options parameters docstring, this would make sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1070550657


   > I fear this makes the docstring overly long and will make displaying a bit uncomfortable.
   
   Another option would be to move them to the user guide. That of course makes them less accessible from the console or notebook where you check the docstring, but for the online docs that might actually give a nicer experience. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r832167636



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])

Review comment:
       Oh great, thanks for clarifying! Will correct the example to show this option.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r831889465



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Yes, I used `\n` in the example above and the docstrings got printed multi-lined in interactive session.
   
   Corrected docstrings for `csv.read_csv()`:
   
   ```
       Examples
       --------
   
       Defining an example file from bytes object:
   
       >>> import io
       >>> s = b'animals,n_legs,entry\nFlamingo,2,2022-03-01\nHorse,4,2022-03-02\nBrittle stars,5,2022-03-03\nCentipede,100,2022-03-04
       >>> print(s.decode())
   
       Reading from the file
   ...
   ```
   
   and docstring print for `csv.read_csv?`:
   
   ```
       Examples
       --------
   
       Defining an example file from bytes object:
   
       >>> import io
       >>> s = b'animals,n_legs,entry
   Flamingo,2,2022-03-01
   Horse,4,2022-03-02
   Brittle stars,5,2022-03-03
   Centipede,100,2022-03-04
       >>> print(s.decode())
   
       Reading from the file
   ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r818658916



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -223,6 +296,34 @@ cdef class ReadOptions(_Weakrefable):
         - `skip_rows` is applied (if non-zero);
         - column names aread (unless `column_names` is set);
         - `skip_rows_after_names` is applied (if non-zero).
+
+        Examples:
+        ---------
+
+        >>> from pyarrow import csv
+
+        >>> read_options = csv.ReadOptions(skip_rows_after_names=1)
+        >>> csv.read_csv("animals.csv", read_options=read_options)
+        pyarrow.Table
+        animals: string
+        n_legs: int64
+        entry: string
+        ----
+        animals: [["Horse","Brittle stars","Centipede"]]
+        n_legs: [[4,5,100]]
+        entry: [["02/03/2022","03/03/2022","04/03/2022"]]

Review comment:
       Yes, true, it works great by default like that. But I needed an example for `timestamp_parsers` =)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r819530084



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -178,6 +183,33 @@ cdef class ReadOptions(_Weakrefable):
         The number of rows to skip before the column names (if any)
         and the CSV data.
         See `skip_rows_after_names` for interaction description
+
+        Examples:
+        ---------
+        >>> from pyarrow import csv

Review comment:
       > put all the examples under the `csv.read_csv()` and so this will not be repeating.
   
   All in the `csv.ReadOptions` docstring (and `csv.ConvertOptions` etc), or all under `csv.read_csv` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1070550657


   > I fear this makes the docstring overly long and will make displaying a bit uncomfortable.
   
   Another option would be to move them to the user guide. That of course makes them less accessible from the console or notebook where you check the docstring, but for the online docs that might actually give a nicer experience. 
   
   EDIT: Alenka and I just chatted about this, and maybe a better option is to improve the user guide a bit to have better pointers to the options, and then keep the examples in the option docstrings as is being done now in the PR here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r830895572



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       In this case the docstring of a a single-line string is printed correctly interactively (in IPython or Jupyter Notebook at least) and the lines are duplicated by calling `print()` which makes the whole example longer:
   
   ```python
       Examples
       --------
   
       Defining an example file from bytes object:
   
       >>> import io
       >>> s = b'''1,2,3
   Flamingo,2,2022-03-01
   Horse,4,2022-03-02
   Brittle stars,5,2022-03-03
   Centipede,100,2022-03-04'''
       >>> print(s.decode())
       1,2,3
       Flamingo,2,2022-03-01
       Horse,4,2022-03-02
       Brittle stars,5,2022-03-03
       Centipede,100,2022-03-04
   ```
   
   But if print is omitted, the example in html version is missing a better visual of the data. 
   
   I do not have a good idea what I would suggest at this moment, but will think about it and add ideas here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r832188785



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Ah, I see (didn't get that what you showed was the output from `?`). Hmm, for the raw docstring that indeed gives a duplication. But for the online / html rendered docs, it would be useful to include ..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1075292879


   The failed build is not connected.
   I still need to update the user guide (docs/dev/python/csv.html) - in my next commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r834222139



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -299,6 +352,48 @@ cdef class ParseOptions(_Weakrefable):
         parsing (because of a mismatching number of columns).
         It should accept a single InvalidRow argument and return either
         "skip" or "error" depending on the desired outcome.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals;n_legs;entry
+    ... Flamingo;2;2022-03-01
+    ... # Comment here:
+    ... Horse;4;2022-03-02
+    ... Brittle stars;5;2022-03-03
+    ... Centipede;100;2022-03-04'''
+
+    Read the data from a file skipping rows with comments
+    and defining the delimiter:
+
+    >>> from pyarrow import csv
+
+    >>> class InvalidRowHandler:
+    ...     def __init__(self, result):
+    ...         self.result = result
+    ...     def __call__(self, row):
+    ...         if row.text.startswith("# "):
+    ...             return self.result
+    ...         else:
+    ...             return 'error'
+    ...
+    >>> skip_handler = InvalidRowHandler('skip')

Review comment:
       Yes, that would be better. Will try it out, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1056842553


   https://issues.apache.org/jira/browse/ARROW-15432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r819542757



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -178,6 +183,33 @@ cdef class ReadOptions(_Weakrefable):
         The number of rows to skip before the column names (if any)
         and the CSV data.
         See `skip_rows_after_names` for interaction description
+
+        Examples:
+        ---------
+        >>> from pyarrow import csv

Review comment:
       I would suggest to add all under `csv.read_csv` as there is more space available (html version) and I would be able to connect between different options/examples.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1063933717


   Separating them by the Options would make it better? I think that is fine also.
   I would not give examples per Options parameter though (as did before), but per one Option altogether.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r829124562



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],

Review comment:
       Leaving out this selection for this example, it will show that the dict encoding only happens for the string column and not for the numerical column

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo",null,"Brittle stars","Centipede"]]
+
+    Define values to be True and False when converting a column
+    into a bool type:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   false_values=["Flamingo","Horse"],
+    ...                   true_values=["Brittle stars","Centipede"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: bool
+    ----
+    animals: [[false,false,true,true]]
+
+    Change the type of a column:

Review comment:
       Suggestion to move this example towards the beginning of this set of examples, as I think specifying the column type might be one of the most typical things to do.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3

Review comment:
       same here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r828971202



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -299,6 +352,48 @@ cdef class ParseOptions(_Weakrefable):
         parsing (because of a mismatching number of columns).
         It should accept a single InvalidRow argument and return either
         "skip" or "error" depending on the desired outcome.
+
+    Example:
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1;2;3

Review comment:
       For this example here I would use the descriptive column names (since you don't need to show providing the names manually)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------

Review comment:
       ```suggestion
       Examples
       --------
   ```
   
   (and the same for the other docstrings, see https://numpydoc.readthedocs.io/en/latest/format.html#examples)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #12543:
URL: https://github.com/apache/arrow/pull/12543


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Failed :arrow_down:1.08% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/404| `be45ec60` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/390| `be45ec60` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/390| `be45ec60` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/400| `be45ec60` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/403| `7de798a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/389| `7de798a0` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/389| `7de798a0` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/399| `7de798a0` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r828967518



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       The annoying thing here with the `...`, while correct, is that if you copy the example to run it yourself, it won't work. I don't know a good solution though ...

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])

Review comment:
       This seems like bad practice to mix values with month-first vs day-first in a single column, so maybe this is not the best example to show (or maybe use one with a different delimiter instead, like `["%m/%d/%Y", "%m-%d-%Y"]`)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo",null,"Brittle stars","Centipede"]]
+
+    Define values to be True and False when converting a column
+    into a bool type:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   false_values=["Flamingo","Horse"],
+    ...                   true_values=["Brittle stars","Centipede"])

Review comment:
       We could also decide to use a few different variants of the example csv data, which could make the example more realistic. For example, you could have a columns with "F" and "N" values, or "Yes" and "No" or something like that.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Checking the pandas guide, where we seem to solve this by defining it as a single-line string with manual `\n` in it for line breaks, but then also print it first to be able to see what it looks like. For example see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#specifying-column-data-types




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/404| `be45ec60` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/390| `be45ec60` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/400| `be45ec60` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/403| `7de798a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/389| `7de798a0` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/399| `7de798a0` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r837252511



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -938,6 +1169,34 @@ def read_csv(input_file, read_options=None, parse_options=None,
     -------
     :class:`pyarrow.Table`
         Contents of the CSV file as a in-memory table.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = "animals,n_legs,entry\\nFlamingo,2,2022-03-01\\nHorse,4,2022-03-02\\nBrittle stars,5,2022-03-03\\nCentipede,100,2022-03-04"
+    >>> print(s)
+    animals,n_legs,entry
+    Flamingo,2,2022-03-01
+    Horse,4,2022-03-02
+    Brittle stars,5,2022-03-03
+    Centipede,100,2022-03-04
+    >>> source = io.BytesIO(s.encode())
+
+    Reading from the file
+
+    >>> from pyarrow import csv
+    >>> csv.read_csv(source)

Review comment:
       Line 1186 ⬆️ 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r818656432



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -328,6 +429,30 @@ cdef class ParseOptions(_Weakrefable):
     def delimiter(self):
         """
         The character delimiting individual cells in the CSV data.
+
+        Examples:
+        ---------
+
+        >>> from pyarrow import csv
+
+        >>> parse_options = csv.ParseOptions(delimiter=";")
+        >>> csv.read_csv("animals.csv", parse_options=parse_options)
+        pyarrow.Table
+        animals,"n_legs","entry": string
+        ----
+        animals,"n_legs","entry": [["Flamingo,2,"01/03/2022"","Horse,4,"02/03/2022"",
+        "Brittle stars,5,"03/03/2022"","Centipede,100,"04/03/2022""]]

Review comment:
       One option would be to create different `.csv` files so data could be parsed correctly but I can't do that with `csv.write_csv`.
   
   I agree, it doesn't look nice. It does show what it does, but I should find a better way to do this.
   
   Maybe it would be better just to include the examples that work/not confuse and leave out the others. Once the user sees how the options are used, there is no need to list them all anyways. Especially if they will be listed under one docsting.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/404| `be45ec60` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/390| `be45ec60` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/400| `be45ec60` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/403| `7de798a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/389| `7de798a0` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/389| `7de798a0` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/399| `7de798a0` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r837247576



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -938,6 +1169,34 @@ def read_csv(input_file, read_options=None, parse_options=None,
     -------
     :class:`pyarrow.Table`
         Contents of the CSV file as a in-memory table.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = "animals,n_legs,entry\\nFlamingo,2,2022-03-01\\nHorse,4,2022-03-02\\nBrittle stars,5,2022-03-03\\nCentipede,100,2022-03-04"
+    >>> print(s)
+    animals,n_legs,entry
+    Flamingo,2,2022-03-01
+    Horse,4,2022-03-02
+    Brittle stars,5,2022-03-03
+    Centipede,100,2022-03-04
+    >>> source = io.BytesIO(s.encode())
+
+    Reading from the file
+
+    >>> from pyarrow import csv
+    >>> csv.read_csv(source)

Review comment:
       `source` doesn't exist here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#issuecomment-1081644163


   Benchmark runs are scheduled for baseline = 7de798a0bb120920553f1bef3b05dfc6637c0f7a and contender = be45ec60ad707d1c9139bc910ab84435248aa61f. be45ec60ad707d1c9139bc910ab84435248aa61f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/180cb19bb44042099d38d0ec832ec0e1...4200ff1d85464e2ba89ebbd876cc5ce6/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/730d2af0fa8f41739bade855b5a5682f...94f21f73813043a6ab515f26b82ec48e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6e1dd9efb43a41f4aea4aae55264a7ad...d2cdd520086d4b6581d1e9179d6c6785/)
   [Finished :arrow_down:0.55% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/edc83b3a4fb24c8dad06d1b3c5f5d192...c8fa59792c7240f793f2496c0225df5e/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r818633656



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -178,6 +183,33 @@ cdef class ReadOptions(_Weakrefable):
         The number of rows to skip before the column names (if any)
         and the CSV data.
         See `skip_rows_after_names` for interaction description
+
+        Examples:
+        ---------
+        >>> from pyarrow import csv

Review comment:
       I'm not sure it's worth repeating the import at the top of each example. But otherwise you should add one at the top of the `use_threads` example.
   
   @jorisvandenbossche Thoughts?

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -340,6 +465,22 @@ cdef class ParseOptions(_Weakrefable):
         """
         The character used optionally for quoting CSV values
         (False if quoting is not allowed).
+
+        Examples:
+        ---------
+
+        >>> from pyarrow import csv
+
+        >>> parse_options = csv.ParseOptions(quote_char=",")
+        >>> csv.read_csv("animals.csv", parse_options=parse_options)
+        pyarrow.Table
+        "animals": string
+        "n_legs": int64
+        "entry": string
+        ----
+        "animals": [[""Flamingo"",""Horse"",""Brittle stars"",""Centipede""]]

Review comment:
       Similar question here, and the result will probably confuse the user.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -223,6 +296,34 @@ cdef class ReadOptions(_Weakrefable):
         - `skip_rows` is applied (if non-zero);
         - column names aread (unless `column_names` is set);
         - `skip_rows_after_names` is applied (if non-zero).
+
+        Examples:
+        ---------
+
+        >>> from pyarrow import csv
+
+        >>> read_options = csv.ReadOptions(skip_rows_after_names=1)
+        >>> csv.read_csv("animals.csv", read_options=read_options)
+        pyarrow.Table
+        animals: string
+        n_legs: int64
+        entry: string
+        ----
+        animals: [["Horse","Brittle stars","Centipede"]]
+        n_legs: [[4,5,100]]
+        entry: [["02/03/2022","03/03/2022","04/03/2022"]]

Review comment:
       Sidenote: if the dates where in ISO format (e.g. "2022-03-02"), they would be inferred neatly as date32.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -328,6 +429,30 @@ cdef class ParseOptions(_Weakrefable):
     def delimiter(self):
         """
         The character delimiting individual cells in the CSV data.
+
+        Examples:
+        ---------
+
+        >>> from pyarrow import csv
+
+        >>> parse_options = csv.ParseOptions(delimiter=";")
+        >>> csv.read_csv("animals.csv", parse_options=parse_options)
+        pyarrow.Table
+        animals,"n_legs","entry": string
+        ----
+        animals,"n_legs","entry": [["Flamingo,2,"01/03/2022"","Horse,4,"02/03/2022"",
+        "Brittle stars,5,"03/03/2022"","Centipede,100,"04/03/2022""]]

Review comment:
       I don't know... is it useful to show a CSV file being parsed with the wrong delimiter?

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -190,6 +222,21 @@ cdef class ReadOptions(_Weakrefable):
         """
         The column names of the target table.  If empty, fall back on
         `autogenerate_column_names`.
+
+        Examples:
+        ---------
+        >>> from pyarrow import csv
+
+        >>> >>> read_options = csv.ReadOptions(column_names=["a", "n", "d"])

Review comment:
       ```suggestion
           >>>read_options = csv.ReadOptions(column_names=["a", "n", "d"])
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -152,6 +152,11 @@ cdef class ReadOptions(_Weakrefable):
     def use_threads(self):
         """
         Whether to use multiple threads to accelerate reading.
+
+        Examples:
+        ---------

Review comment:
       I don't think numpydoc expects a trailing colon:
   ```suggestion
           Examples
           --------
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -359,6 +500,11 @@ cdef class ParseOptions(_Weakrefable):
         """
         Whether two quotes in a quoted CSV value denote a single quote
         in the data.
+
+        Examples:
+        ---------
+        >>> parse_options = csv.ParseOptions(double_quote=False)
+        >>> csv.read_csv(input_file, parse_options=parse_options)

Review comment:
       I think we don't necessarily have to add an example if the example doesn't show anything interesting :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r828971202



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -299,6 +352,48 @@ cdef class ParseOptions(_Weakrefable):
         parsing (because of a mismatching number of columns).
         It should accept a single InvalidRow argument and return either
         "skip" or "error" depending on the desired outcome.
+
+    Example:
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1;2;3

Review comment:
       For this example here I would use the descriptive column names (since you don't need to show providing the names manually)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------

Review comment:
       ```suggestion
       Examples
       --------
   ```
   
   (and the same for the other docstrings, see https://numpydoc.readthedocs.io/en/latest/format.html#examples)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column

Review comment:
       ```suggestion
       Define a date parsing format to get a timestamp type column
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])

Review comment:
       A typical example for "strings_can_be_null" is where you have data that uses a empty slot for missing values (eg pandas does this by default for writing data):
   
   ```
   In [5]: data = b"a,b\nA,1\n,2\nC,3"
   
   In [6]: print(data.decode())
   a,b
   A,1
   ,2
   C,3
   
   In [9]: csv.read_csv(io.BytesIO(data))
   Out[9]: 
   pyarrow.Table
   a: string
   b: int64
   ----
   a: [["A","","C"]]
   b: [[1,2,3]]
   
   In [10]: csv.read_csv(io.BytesIO(data), convert_options=csv.ConvertOptions(strings_can_be_null=True))
   Out[10]: 
   pyarrow.Table
   a: string
   b: int64
   ----
   a: [["A",null,"C"]]
   b: [[1,2,3]]
   
   
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:

Review comment:
       It makes it longer, but it might be clearer to have two examples here: first only read a subset of columns, and then show you can also list more columns and get them included as null typed column.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],

Review comment:
       Leaving out this selection for this example, it will show that the dict encoding only happens for the string column and not for the numerical column

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo",null,"Brittle stars","Centipede"]]
+
+    Define values to be True and False when converting a column
+    into a bool type:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   false_values=["Flamingo","Horse"],
+    ...                   true_values=["Brittle stars","Centipede"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: bool
+    ----
+    animals: [[false,false,true,true]]
+
+    Change the type of a column:

Review comment:
       Suggestion to move this example towards the beginning of this set of examples, as I think specifying the column type might be one of the most typical things to do.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3

Review comment:
       same here

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       The annoying thing here with the `...`, while correct, is that if you copy the example to run it yourself, it won't work. I don't know a good solution though ...

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])

Review comment:
       This seems like bad practice to mix values with month-first vs day-first in a single column, so maybe this is not the best example to show (or maybe use one with a different delimiter instead, like `["%m/%d/%Y", "%m-%d-%Y"]`)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo",null,"Brittle stars","Centipede"]]
+
+    Define values to be True and False when converting a column
+    into a bool type:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   false_values=["Flamingo","Horse"],
+    ...                   true_values=["Brittle stars","Centipede"])

Review comment:
       We could also decide to use a few different variants of the example csv data, which could make the example more realistic. For example, you could have a columns with "F" and "N" values, or "Yes" and "No" or something like that.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Checking the pandas guide, where we seem to solve this by defining it as a single-line string with manual `\n` in it for line breaks, but then also print it first to be able to see what it looks like. For example see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#specifying-column-data-types




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r834214328



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,143 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals,n_legs,entry,fast
+    ... Flamingo,2,01/03/2022,Yes
+    ... Horse,4,02/03/2022,Yes
+    ... Brittle stars,5,03/03/2022,No
+    ... Centipede,100,04/03/2022,No
+    ... ,6,05/03/2022,'''
+
+    Change the type of a column:
+
+    >>> import pyarrow as pa
+    >>> convert_options = csv.ConvertOptions(column_types={"n_legs": pa.float64()})
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: double
+    entry: string
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Define a date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Specify a subset of columns to be read:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+
+    List additional column to be included as a null typed column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    location: [5 nulls]
+
+    Define a column as a dictionary:

Review comment:
       ```suggestion
       Define a column as a dictionary type:
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,143 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals,n_legs,entry,fast
+    ... Flamingo,2,01/03/2022,Yes
+    ... Horse,4,02/03/2022,Yes
+    ... Brittle stars,5,03/03/2022,No
+    ... Centipede,100,04/03/2022,No
+    ... ,6,05/03/2022,'''
+
+    Change the type of a column:
+
+    >>> import pyarrow as pa
+    >>> convert_options = csv.ConvertOptions(column_types={"n_legs": pa.float64()})
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: double
+    entry: string
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Define a date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Specify a subset of columns to be read:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+
+    List additional column to be included as a null typed column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    location: [5 nulls]
+
+    Define a column as a dictionary:

Review comment:
       Maybe also mention that by default only the string/binary columns are dictionary encoded?

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -299,6 +352,48 @@ cdef class ParseOptions(_Weakrefable):
         parsing (because of a mismatching number of columns).
         It should accept a single InvalidRow argument and return either
         "skip" or "error" depending on the desired outcome.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals;n_legs;entry
+    ... Flamingo;2;2022-03-01
+    ... # Comment here:
+    ... Horse;4;2022-03-02
+    ... Brittle stars;5;2022-03-03
+    ... Centipede;100;2022-03-04'''
+
+    Read the data from a file skipping rows with comments
+    and defining the delimiter:
+
+    >>> from pyarrow import csv
+
+    >>> class InvalidRowHandler:
+    ...     def __init__(self, result):
+    ...         self.result = result
+    ...     def __call__(self, row):
+    ...         if row.text.startswith("# "):
+    ...             return self.result
+    ...         else:
+    ...             return 'error'
+    ...
+    >>> skip_handler = InvalidRowHandler('skip')

Review comment:
       We used such a class in the tests (to make it more flexible, and to test what was passed), but for this example here, it could maybe also be a simpler function? Something like:
   
   ```
   def skip_comment(row):
       if row.text.startswith("# "):
           return "skip"
       return "error"
   ```
   
   (didn't test if it works)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,143 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals,n_legs,entry,fast
+    ... Flamingo,2,01/03/2022,Yes
+    ... Horse,4,02/03/2022,Yes
+    ... Brittle stars,5,03/03/2022,No
+    ... Centipede,100,04/03/2022,No
+    ... ,6,05/03/2022,'''
+
+    Change the type of a column:
+
+    >>> import pyarrow as pa
+    >>> convert_options = csv.ConvertOptions(column_types={"n_legs": pa.float64()})
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: double
+    entry: string
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Define a date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Specify a subset of columns to be read:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+
+    List additional column to be included as a null typed column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    location: [5 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    n_legs: int64
+    entry: timestamp[s]
+    fast: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede",""]  -- indices:
+    [0,1,2,3,4]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [  -- dictionary:
+    ["Yes","No",""]  -- indices:
+    [0,0,1,1,2]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+
+    Set empty strings to missing values:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals", "n_legs"],
+    ...                   strings_can_be_null = True)

Review comment:
       ```suggestion
       ...                   strings_can_be_null=True)
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Idea from our call: use `\\n` if that also renders fine in the html docs?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r830765507



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------

Review comment:
       Oh this is great, now I can see the change in the Sphinx docs (html version) also. Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r831164514



##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       I think if you do include the "print", I would use explicit newlines (`\n`) in `s`, so that this becomes a one-liner (and avoids making the example a lot longer). 
   Or, if you keep the multi-line string as in your example above, I think you can also leave out the "print" call.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org