You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/14 15:04:22 UTC

[GitHub] [arrow] pitrou commented on a change in pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

pitrou commented on a change in pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#discussion_r542454058



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2240,6 +2240,89 @@ def test_dataset_project_null_column(tempdir):
     assert dataset.to_table().equals(expected)
 
 
+@pytest.mark.parquet
+def test_dataset_validate_schema_keyword(tempdir):
+    # ARROW-8221
+    import pyarrow.parquet as pq
+
+    basedir = tempdir / "dataset_mismatched_schemas"
+    basedir.mkdir()
+
+    table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
+    pq.write_table(table1, basedir / "data1.parquet")
+    table2 = pa.table({'a': ["a", "b", "c"], 'b': [1, 2, 3]})
+    pq.write_table(table2, basedir / "data2.parquet")
+
+    msg_scanning = "matching names but differing types"
+    msg_inspecting = "Unable to merge: Field a has incompatible types"
+
+    # default (inspecting first fragments) works, but fails scanning
+    dataset = ds.dataset(basedir)
+    assert dataset.schema.equals(table1.schema)

Review comment:
       Do datasets guarantee that the first file in alphabetical order is used to infer the schema?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org