You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/18 09:43:08 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

jorisvandenbossche commented on a change in pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#discussion_r442102463



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1566,3 +1566,21 @@ def test_parquet_dataset_factory_partitioned(tempdir):
     result = result.to_pandas().sort_values("f1").reset_index(drop=True)
     expected = table.to_pandas().drop(columns=["part"])
     pd.testing.assert_frame_equal(result, expected)
+
+
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_dataset_schema_metadata(tempdir):
+    # ARROW-8802
+    df = pd.DataFrame({'a': [1, 2, 3]})
+    path = tempdir / "test.parquet"
+    df.to_parquet(path)
+    dataset = ds.dataset(path)
+
+    schema = dataset.to_table().schema
+    projected_schema = dataset.to_table(columns=["a"]).schema
+
+    # ensure the pandas metadata is included in the schema
+    assert b"pandas" in schema.metadata

Review comment:
       added an additional assert to ensure the "pandas" key is actually present, because if for some reason we accidentally remove it in both cases, this test won't detect that as both schema's are still identical (both missing the metadata). 
   (although I assume we have other tests that would start failing then)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org