You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/26 11:28:37 UTC

[GitHub] [arrow] joosthooz opened a new pull request, #13709: Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

joosthooz opened a new pull request, #13709:
URL: https://github.com/apache/arrow/pull/13709

   WIP Adding an optional function that wraps all input streams with a user-supplied transcoding function.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1199314506

   Thank you for the comments and suggestions. 
   I implemented Weston's suggestion and added an `encoding` field to `CsvFileFormat` instead of the added parameter to `dataset()`. This works, but I dislike it a lot. The field is a duplicate to the same one in `ReadOptions`. So when using `read_csv`, users need to use the field in `read_options`, but when using `dataset`, they need to use the field in `format`. They can also use both of them, 1 is silently discarded. Here's what this looks like:
   ```
   >>> fo = ds.CsvFileFormat(default_fragment_scan_options=ds.CsvFragmentScanOptions(read_options=csv.ReadOptions(encoding='iso-8259')), encoding='cp1252')
   >>> fo.default_fragment_scan_options.read_options.encoding
   'utf8'
   >>> fo.encoding
   'cp1252'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1198576659

   The current state is that it works, but it relies on the workaround of adding an `encoding` parameter to `pyarrow.dataset()`.
   That needs to be dealt with before proceeding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1201278866

   ([4d819aa](https://github.com/apache/arrow/pull/13709/commits/4d819aa9e01520911b250a6159e8f1815dc47839) should be `Removed encoding from CsvFragmentScanOptions.equals()`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r932617253


##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):

Review Comment:
   These duplicates have been removed, once I figured out how to move all relevant parts into `io.pxi`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1199493960

   I pushed an alternative way of passing the encoding in https://github.com/apache/arrow/pull/13709/commits/22eff73806c8511288c16d87cbf64da26a7d941e. For the user it works the same way as in `read_csv`: it is a field in `read_options`. I store the value in `CsvFragmentScanOptions`, and then restore it.
   How do you feel about this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1209088705

   I would favor #13820, which pushes complexity into Python, over this one, which introduces a dummy option in C++ that has no effect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #13709: Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r931528356


##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):
+    """
+    Construct an Arrow buffer from a Python bytes-like or buffer-like object
+
+    Parameters
+    ----------
+    obj : object
+        the object from which the buffer should be constructed.
+    """
+    cdef shared_ptr[CBuffer] buf
+    buf = GetResultValue(PyBuffer.FromPyObject(obj))
+    return pyarrow_wrap_buffer(buf)
+
+
+# From io.pxi
+cdef void _cb_transform(transform_func, const shared_ptr[CBuffer]& src,
+                        shared_ptr[CBuffer]* dest) except *:
+    py_dest = transform_func(pyarrow_wrap_buffer(src))
+    dest[0] = pyarrow_unwrap_buffer(py_buffer(py_dest))
+
+
+# from io.pxi
+class Transcoder:

Review Comment:
   We mostly discussed this in the JIRA - you'd have to pull in a library like icu if you want to do it on the C++ side, and also Python (at least) has 'special' encodings like 'unicodereplace' that users may or may not expect to be able to use



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1205010971

   > Somewhere in the CSV reader itself we should also validate the option
   
   I tried this, but it doesn't work, because in that case we would need to re-set the field back to `utf8` when adding a transcoder in python. Otherwise, the error is triggered even though we are transcoding to utf8. But then, we will again run into the issue where the `ReadOptions` object that the user created is changed:
   ```
   >>> import pyarrow.dataset as ds
   >>> import pyarrow.csv as csv
   >>> ro =csv.ReadOptions(encoding='iso8859')
   >>> fo = ds.CsvFileFormat(read_options=ro)
   >>> dataset = ds.dataset("file.csv", format=fo)
   >>> ro.encoding
   'utf8'
   ```
   This would be really strange if you ask me. And if we accept this strange behavior, we didn't need to add the `encoding` field in the first place.
   So now, the field is basically ignored in the CSV reader, there only is the check in the dataset CSV reader that there must be a wrapping function set if the encoding is not utf8.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r940149496


##########
python/pyarrow/dataset.py:
##########
@@ -433,6 +433,10 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
+    if isinstance(format, CsvFileFormat):

Review Comment:
   I think that makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13709: Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1195453860

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r932617687


##########
cpp/src/arrow/python/io.h:
##########
@@ -112,5 +112,8 @@ std::shared_ptr<::arrow::io::InputStream> MakeTransformInputStream(
     std::shared_ptr<::arrow::io::InputStream> wrapped, TransformInputStreamVTable vtable,
     PyObject* arg);
 
+std::function<Result<std::shared_ptr<io::InputStream>>(std::shared_ptr<io::InputStream>)>
+makeStreamTransformFunc(TransformInputStreamVTable vtable, PyObject* handler);

Review Comment:
   Done, thanks for pointing this out!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1198568694

   https://issues.apache.org/jira/browse/ARROW-16000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1201283801

   Ok I pushed something completely different. I added encoding as a field in the C struct and some wrapper code that tries to `dlopen` the `libiconv` library. I haven't really tested it beyond seeing that it doesn't crash when I read some data from a dataset. Now the question is how do we let the user specify what they want to do? As in choose between a Python transcoder or a library on his system. And how do we show what libraries we have available? Should we create an example about how peopple can add their own wrappers?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r936052670


##########
cpp/src/arrow/csv/options.h:
##########
@@ -163,6 +163,9 @@ struct ARROW_EXPORT ReadOptions {
   /// If false, column names will be read from the first CSV row after `skip_rows`.
   bool autogenerate_column_names = false;
 
+  /// Character encoding used
+  std::string encoding = "UTF-8";

Review Comment:
   ```suggestion
     /// Character encoding used. Only "UTF-8" is supported.
     std::string encoding = "UTF-8";
   ```



##########
python/pyarrow/io.pxi:
##########
@@ -1547,6 +1547,33 @@ class Transcoder:
         return self._encoder.encode(self._decoder.decode(buf, final), final)
 
 
+cdef shared_ptr[function[StreamWrapFunc]] make_streamwrap_func(
+        src_encoding, dest_encoding) except *:
+    """
+    Create a function that will add a transcoding transformation to a stream.
+    Data from that stream will be decoded according to ``src_encoding`` and
+    then re-encoded according to ``dest_encoding``.
+    The created function can be used to wrap streams.
+
+    Parameters
+    ----------
+    src_encoding : str
+        The codec to use when reading data.
+    dest_encoding : str
+        The codec to use for emitted data.
+    """
+    cdef:
+        shared_ptr[function[StreamWrapFunc]] empty_func
+        CTransformInputStreamVTable vtable
+
+    vtable.transform = _cb_transform
+    src_codec = codecs.lookup(src_encoding)
+    dest_codec = codecs.lookup(dest_encoding)

Review Comment:
   Don't we want to skip this when src == dest?



##########
cpp/src/arrow/dataset/file_csv.cc:
##########
@@ -183,9 +183,19 @@ static inline Future<std::shared_ptr<csv::StreamingReader>> OpenReaderAsync(
   auto tracer = arrow::internal::tracing::GetTracer();
   auto span = tracer->StartSpan("arrow::dataset::CsvFileFormat::OpenReaderAsync");
 #endif
+  ARROW_ASSIGN_OR_RAISE(
+      auto fragment_scan_options,
+      GetFragmentScanOptions<CsvFragmentScanOptions>(
+          kCsvTypeName, scan_options.get(), format.default_fragment_scan_options));
   ARROW_ASSIGN_OR_RAISE(auto reader_options, GetReadOptions(format, scan_options));
-
   ARROW_ASSIGN_OR_RAISE(auto input, source.OpenCompressed());
+  if (reader_options.encoding != "UTF-8") {
+    if (fragment_scan_options->stream_transform_func) {
+      ARROW_ASSIGN_OR_RAISE(input, fragment_scan_options->stream_transform_func(input));
+    } else {
+      return Status::Invalid("File encoding is not UTF-8, but no stream_transform_func has been provided.");

Review Comment:
   Invalid is fine (IOError doesn't quite fit IMO)
   
   ```suggestion
         return Status::Invalid("File encoding is not UTF-8, but no stream_transform_func has been provided");
   ```



##########
cpp/src/arrow/csv/options.h:
##########
@@ -163,6 +163,9 @@ struct ARROW_EXPORT ReadOptions {
   /// If false, column names will be read from the first CSV row after `skip_rows`.
   bool autogenerate_column_names = false;
 
+  /// Character encoding used
+  std::string encoding = "UTF-8";

Review Comment:
   nit: maybe a formal constant for the encoding? 



##########
cpp/src/arrow/dataset/file_csv.cc:
##########
@@ -183,9 +183,19 @@ static inline Future<std::shared_ptr<csv::StreamingReader>> OpenReaderAsync(
   auto tracer = arrow::internal::tracing::GetTracer();
   auto span = tracer->StartSpan("arrow::dataset::CsvFileFormat::OpenReaderAsync");
 #endif
+  ARROW_ASSIGN_OR_RAISE(
+      auto fragment_scan_options,
+      GetFragmentScanOptions<CsvFragmentScanOptions>(
+          kCsvTypeName, scan_options.get(), format.default_fragment_scan_options));
   ARROW_ASSIGN_OR_RAISE(auto reader_options, GetReadOptions(format, scan_options));
-
   ARROW_ASSIGN_OR_RAISE(auto input, source.OpenCompressed());
+  if (reader_options.encoding != "UTF-8") {

Review Comment:
   Somewhere in the CSV reader itself we should also validate the option



##########
python/pyarrow/dataset.py:
##########
@@ -433,6 +433,10 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
+    if isinstance(format, CsvFileFormat):

Review Comment:
   hmm, it feels like this needs to be placed "lower" (unless there aren't any other ways of building a filesystem dataset?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1206764774

   But it's not a big deal, I think so long as the field is clearly documented


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1209068949

   Thanks for checking in, @pitrou ! Most important is to choose which approach to take, to make that easier I opened https://github.com/apache/arrow/pull/13820 to compare against.
   After that I need to check why some of the tests are failing (they seem unrelated) and maybe polish a bit and then I'll move it out of the draft state.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1198568725

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #13709: Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r929931695


##########
cpp/src/arrow/python/io.h:
##########
@@ -112,5 +112,8 @@ std::shared_ptr<::arrow::io::InputStream> MakeTransformInputStream(
     std::shared_ptr<::arrow::io::InputStream> wrapped, TransformInputStreamVTable vtable,
     PyObject* arg);
 
+std::function<Result<std::shared_ptr<io::InputStream>>(std::shared_ptr<io::InputStream>)>
+makeStreamTransformFunc(TransformInputStreamVTable vtable, PyObject* handler);

Review Comment:
   nit: UpperCamelCase?



##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):
+    """
+    Construct an Arrow buffer from a Python bytes-like or buffer-like object
+
+    Parameters
+    ----------
+    obj : object
+        the object from which the buffer should be constructed.
+    """
+    cdef shared_ptr[CBuffer] buf
+    buf = GetResultValue(PyBuffer.FromPyObject(obj))
+    return pyarrow_wrap_buffer(buf)
+
+
+# From io.pxi
+cdef void _cb_transform(transform_func, const shared_ptr[CBuffer]& src,

Review Comment:
   I _think_ you could make this `cimport`able if placed into `lib.pxd` but it's probably not a big deal to just inline it here



##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):

Review Comment:
   This should be importable as just `from pyarrow.lib import py_buffer` (you can see _plasma.pyx and _cuda.pyx do the same)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1203123960

   That sums it up very nicely. Both alternatives are fine with me. I just pushed an update that aims to do what you suggest, which is adding an `encoding` field to the C++ struct. The CSV reader returns an `Invalid` error when the user has specified an encoding other than UTF-8 but the `stream_transform_func` is empty. 
   Is that the right error type or would an `IOError` be more suitable?
   How do you feel about the name of the `set_transcoder` function in `CsvFragmentScanOptions`? Should I add a docstring to it? Should the `stream_transform_func` be added to the equals() function? (in that case I think I need to add a getter/setting for it too)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1203247943

   > How do you feel about the name of the set_transcoder function in CsvFragmentScanOptions? Should I add a docstring to it?
   
   It feels like it shouldn't be publicly accessible? Or else it should mirror the C++ side option name 1:1
   
   > Should the stream_transform_func be added to the equals() function? (in that case I think I need to add a getter/setting for it too)
   
   I guess we can only get pointer equality, but yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r939931651


##########
python/pyarrow/dataset.py:
##########
@@ -433,6 +433,10 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
+    if isinstance(format, CsvFileFormat):

Review Comment:
   I can move this down to the place where the `CsvFragmentScanOptions` is created, and just set the wrapper function in the init function. That also takes care of there being a publicly accessible function which we don't want.



##########
python/pyarrow/io.pxi:
##########
@@ -1547,6 +1547,33 @@ class Transcoder:
         return self._encoder.encode(self._decoder.decode(buf, final), final)
 
 
+cdef shared_ptr[function[StreamWrapFunc]] make_streamwrap_func(
+        src_encoding, dest_encoding) except *:
+    """
+    Create a function that will add a transcoding transformation to a stream.
+    Data from that stream will be decoded according to ``src_encoding`` and
+    then re-encoded according to ``dest_encoding``.
+    The created function can be used to wrap streams.
+
+    Parameters
+    ----------
+    src_encoding : str
+        The codec to use when reading data.
+    dest_encoding : str
+        The codec to use for emitted data.
+    """
+    cdef:
+        shared_ptr[function[StreamWrapFunc]] empty_func
+        CTransformInputStreamVTable vtable
+
+    vtable.transform = _cb_transform
+    src_codec = codecs.lookup(src_encoding)
+    dest_codec = codecs.lookup(dest_encoding)

Review Comment:
   Yes, also because returning an empty function pointer from here does not seem to work!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1208288007

   @joosthooz Do you want reviewing at this point or are you looking to polish this PR first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz closed pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz closed pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner
URL: https://github.com/apache/arrow/pull/13709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r940992786


##########
python/pyarrow/dataset.py:
##########
@@ -433,6 +433,10 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
+    if isinstance(format, CsvFileFormat):

Review Comment:
   Done in [9a5bf20](https://github.com/apache/arrow/pull/13709/commits/9a5bf209a859fd05d5d219421a7ec6cbc205f28f)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r932617487


##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):
+    """
+    Construct an Arrow buffer from a Python bytes-like or buffer-like object
+
+    Parameters
+    ----------
+    obj : object
+        the object from which the buffer should be constructed.
+    """
+    cdef shared_ptr[CBuffer] buf
+    buf = GetResultValue(PyBuffer.FromPyObject(obj))
+    return pyarrow_wrap_buffer(buf)
+
+
+# From io.pxi
+cdef void _cb_transform(transform_func, const shared_ptr[CBuffer]& src,

Review Comment:
   Duplicate, was removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #13709: Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

westonpace commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r931516618


##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):
+    """
+    Construct an Arrow buffer from a Python bytes-like or buffer-like object
+
+    Parameters
+    ----------
+    obj : object
+        the object from which the buffer should be constructed.
+    """
+    cdef shared_ptr[CBuffer] buf
+    buf = GetResultValue(PyBuffer.FromPyObject(obj))
+    return pyarrow_wrap_buffer(buf)
+
+
+# From io.pxi
+cdef void _cb_transform(transform_func, const shared_ptr[CBuffer]& src,
+                        shared_ptr[CBuffer]* dest) except *:
+    py_dest = transform_func(pyarrow_wrap_buffer(src))
+    dest[0] = pyarrow_unwrap_buffer(py_buffer(py_dest))
+
+
+# from io.pxi
+class Transcoder:

Review Comment:
   Hmm, do we really want to jump back to python here instead of using C++ utilities for decoding? (I'm not sure if there are any good standard utilities so maybe the answer is yes).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r932621048


##########
python/pyarrow/_dataset.pyx:
##########
@@ -1237,6 +1238,40 @@ cdef class CsvFileFormat(FileFormat):
         return f"<CsvFileFormat parse_options={self.parse_options}>"
 
 
+# From io.pxi
+def py_buffer(object obj):
+    """
+    Construct an Arrow buffer from a Python bytes-like or buffer-like object
+
+    Parameters
+    ----------
+    obj : object
+        the object from which the buffer should be constructed.
+    """
+    cdef shared_ptr[CBuffer] buf
+    buf = GetResultValue(PyBuffer.FromPyObject(obj))
+    return pyarrow_wrap_buffer(buf)
+
+
+# From io.pxi
+cdef void _cb_transform(transform_func, const shared_ptr[CBuffer]& src,
+                        shared_ptr[CBuffer]* dest) except *:
+    py_dest = transform_func(pyarrow_wrap_buffer(src))
+    dest[0] = pyarrow_unwrap_buffer(py_buffer(py_dest))
+
+
+# from io.pxi
+class Transcoder:

Review Comment:
   I hope to add that as a possibility in the future, but for now I wanted to mimic the behavior of `read_csv` as much as possible. We'll have to see how bad of a bottleneck this will create. But for scanning a single file it shouldn't matter, and that is good enough for my use case because I just want to be able to deal with files that are larger than memory (which pyarrow.dataset will allow me to do and read_csv will not)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

westonpace commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r932704893


##########
python/pyarrow/io.pxi:
##########
@@ -1547,6 +1547,33 @@ class Transcoder:
         return self._encoder.encode(self._decoder.decode(buf, final), final)
 
 
+cdef shared_ptr[function[StreamWrapFunc]] make_streamwrap_func(
+        src_encoding, dest_encoding) except *:
+    """
+    Create a function that will add a transcoding transformation to a stream.
+    Data from that stream will be decoded according to ``src_encoding`` and
+    then re-encoded according to ``dest_encoding``.
+    The created function can be used to wrap streams once they are created.
+
+    Parameters
+    ----------
+    src_encoding : str
+        The codec to use when reading data data.

Review Comment:
   ```suggestion
           The codec to use when reading data.
   ```



##########
python/pyarrow/io.pxi:
##########
@@ -1547,6 +1547,33 @@ class Transcoder:
         return self._encoder.encode(self._decoder.decode(buf, final), final)
 
 
+cdef shared_ptr[function[StreamWrapFunc]] make_streamwrap_func(
+        src_encoding, dest_encoding) except *:
+    """
+    Create a function that will add a transcoding transformation to a stream.
+    Data from that stream will be decoded according to ``src_encoding`` and
+    then re-encoded according to ``dest_encoding``.
+    The created function can be used to wrap streams once they are created.

Review Comment:
   ```suggestion
       The created function can be used to wrap streams.
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1201290515

   I think we're getting a bit far afield…Dynamic linking needs platform-specific code and usually we configure optional dependencies with build flags.
   
   What if we add the C++-side field, have it error in C++ if not set to the default, and in python, we can reset the value to the default and configure the transcoder? That leaves us the path to upgrade and should avoid excessive python-side hacks. If we decide it's valuable to have built-in C++ side transcoding, then we have the option there already. 
   
   An alternative would be to have the Python wrappers for these structs no longer actually wrap the C++ structs, so that we aren't limited to the C++ fields. But that would lead to some code duplication/messiness as well.
   
   I'm not sure we can avoid some messiness: the fundamental issue is that we have a Python-only field but are trying to directly wrap the C++ structs. That extra field needs to be mirrored _somewhere_. Either we do work to pass it around on the Python side or we give in and add it in C++.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r939132124


##########
python/pyarrow/io.pxi:
##########
@@ -1547,6 +1547,33 @@ class Transcoder:
         return self._encoder.encode(self._decoder.decode(buf, final), final)
 
 
+cdef shared_ptr[function[StreamWrapFunc]] make_streamwrap_func(
+        src_encoding, dest_encoding) except *:
+    """
+    Create a function that will add a transcoding transformation to a stream.
+    Data from that stream will be decoded according to ``src_encoding`` and
+    then re-encoded according to ``dest_encoding``.
+    The created function can be used to wrap streams.
+
+    Parameters
+    ----------
+    src_encoding : str
+        The codec to use when reading data.
+    dest_encoding : str
+        The codec to use for emitted data.
+    """
+    cdef:
+        shared_ptr[function[StreamWrapFunc]] empty_func
+        CTransformInputStreamVTable vtable
+
+    vtable.transform = _cb_transform
+    src_codec = codecs.lookup(src_encoding)
+    dest_codec = codecs.lookup(dest_encoding)

Review Comment:
   Ah, the skip happens higher up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r936757983


##########
cpp/src/arrow/csv/options.h:
##########
@@ -163,6 +163,9 @@ struct ARROW_EXPORT ReadOptions {
   /// If false, column names will be read from the first CSV row after `skip_rows`.
   bool autogenerate_column_names = false;
 
+  /// Character encoding used
+  std::string encoding = "UTF-8";

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1206764452

   Ah, thanks for explaining.
   
   Wonder if we should/could pass a copy of the ReadOptions then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1199853722

   > I store the value in CsvFragmentScanOptions, and then restore it.
   > How do you feel about this?
   
   I like this approach if you can get it working.  Can you add this to the `CsvFileFormat` constructor?
   
   ```
           else :
               # default_fragment_scan_options is needed to add a transcoder
               self.default_fragment_scan_options = CsvFragmentScanOptions()
           if read_options is not None:
               self.default_fragment_scan_options.encoding = read_options.encoding
   ```
   
   >  Is it an option to add encoding as a field to the C struct ReadOptions?
   
   That seems undesirable.  The C++ csv reader doesn't have the field because it has no ability to handle encodings.  So I'm not sure we want to add a field that is completely ignored.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1201019805

   I got it working but it isn't pretty. I had to duplicate the python-specific `encoding` field to both `CsvFileFormat` and `CsvFragmentScanOptions`, and modify the getter function `default_fragment_scan_options` in `FileFormat`!
   From https://github.com/apache/arrow/pull/13709/commits/4d9802b60480180e8deab6dd162a72d837cccdeb:
   ```It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent.
   1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding.
   To the user it would be unclear which of these values would be eventually used by the transcoding.
   ```
   This also seems pretty undesirable. 
   What about if we added an option to allow the user to specify a library to do the transcoding in C++, that we then dynamically link to? Then we can just add the field to the C struct and the Python transcoding in this PR would also still work nicely. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1209452566

   Continuing here: https://github.com/apache/arrow/pull/13820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] joosthooz commented on a diff in pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Posted by GitBox <gi...@apache.org>.

joosthooz commented on code in PR #13709:
URL: https://github.com/apache/arrow/pull/13709#discussion_r939932437


##########
cpp/src/arrow/dataset/file_csv.cc:
##########
@@ -183,9 +183,19 @@ static inline Future<std::shared_ptr<csv::StreamingReader>> OpenReaderAsync(
   auto tracer = arrow::internal::tracing::GetTracer();
   auto span = tracer->StartSpan("arrow::dataset::CsvFileFormat::OpenReaderAsync");
 #endif
+  ARROW_ASSIGN_OR_RAISE(
+      auto fragment_scan_options,
+      GetFragmentScanOptions<CsvFragmentScanOptions>(
+          kCsvTypeName, scan_options.get(), format.default_fragment_scan_options));
   ARROW_ASSIGN_OR_RAISE(auto reader_options, GetReadOptions(format, scan_options));
-
   ARROW_ASSIGN_OR_RAISE(auto input, source.OpenCompressed());
+  if (reader_options.encoding != "UTF-8") {

Review Comment:
   I made a top-level comment about this here https://github.com/apache/arrow/pull/13709#issuecomment-1205010971



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org