You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/14 16:01:09 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

jorisvandenbossche commented on a change in pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#discussion_r542500407



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2240,6 +2240,89 @@ def test_dataset_project_null_column(tempdir):
     assert dataset.to_table().equals(expected)
 
 
+@pytest.mark.parquet
+def test_dataset_validate_schema_keyword(tempdir):
+    # ARROW-8221
+    import pyarrow.parquet as pq
+
+    basedir = tempdir / "dataset_mismatched_schemas"
+    basedir.mkdir()
+
+    table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
+    pq.write_table(table1, basedir / "data1.parquet")
+    table2 = pa.table({'a': ["a", "b", "c"], 'b': [1, 2, 3]})
+    pq.write_table(table2, basedir / "data2.parquet")
+
+    msg_scanning = "matching names but differing types"
+    msg_inspecting = "Unable to merge: Field a has incompatible types"
+
+    # default (inspecting first fragments) works, but fails scanning
+    dataset = ds.dataset(basedir)
+    assert dataset.schema.equals(table1.schema)

Review comment:
       Yes, the file paths get sorted:
   
   https://github.com/apache/arrow/blob/48fee6672bd8f740cfde9efdec0004641bf462c2/cpp/src/arrow/dataset/discovery.cc#L205
   
   (now, whether this should maybe rather be a "natural" sort is another issue ..)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org