You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/16 17:35:06 UTC

[GitHub] [arrow] pitrou opened a new pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

pitrou opened a new pull request #7456:
URL: https://github.com/apache/arrow/pull/7456


   Also add a C++ InputStream wrapper that transforms data from the stream according to an arbitrary callable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-645282287


   Should the user guide also be updated? (the changes in `docs/source/cpp/csv.rst ` and `docs/source/python/csv.rst` from https://github.com/apache/arrow/commit/1325d139b7301df40ec0b22a74afd675aebe0d94)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7456:
URL: https://github.com/apache/arrow/pull/7456


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-646550402


   I'll add a test that encoding errors are propagated correctly (hopefully :-)).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-645283324


   Ah, right, thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-644915754


   https://issues.apache.org/jira/browse/ARROW-9106


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-649580251


   Looks good here. Merging


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-649522260


   Ok, I added some tests for error propagation. I'm going to merge if CI stays green.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#discussion_r442545757



##########
File path: python/pyarrow/tests/test_io.py
##########
@@ -1289,6 +1290,56 @@ def test_compressed_recordbatch_stream(compression):
     assert got_table == table
 
 
+# ----------------------------------------------------------------------
+# Transform input streams
+
+unicode_transcoding_example = (
+    "Dès Noël où un zéphyr haï me vêt de glaçons würmiens "
+    "je dîne d’exquis rôtis de bœuf au kir à l’aÿ d’âge mûr & cætera !"
+)
+
+
+def check_transcoding(data, src_encoding, dest_encoding, chunk_sizes):
+    chunk_sizes = iter(chunk_sizes)
+    stream = pa.transcoding_input_stream(
+        pa.BufferReader(data.encode(src_encoding)),
+        src_encoding, dest_encoding)
+    out = []
+    while True:
+        buf = stream.read(next(chunk_sizes))
+        out.append(buf)
+        if not buf:
+            break
+    out = b''.join(out)
+    assert out.decode(dest_encoding) == data
+
+
+@pytest.mark.parametrize('src_encoding, dest_encoding',
+                         [('utf-8', 'utf-16'),
+                          ('utf-16', 'utf-8'),
+                          ('utf-8', 'utf-32-le'),
+                          ('utf-8', 'utf-32-be'),
+                          ])
+def test_transcoding_input_stream(src_encoding, dest_encoding):
+    # All at once
+    check_transcoding(unicode_transcoding_example,
+                      src_encoding, dest_encoding, [1000, 0])
+    # Incremental
+    check_transcoding(unicode_transcoding_example,

Review comment:
       TODO: should perhaps exercise encoding errors




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org