You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/17 19:13:53 UTC

[GitHub] [arrow] fsaintjacques opened a new pull request #7474: ARROW-8802: [C++][Dataset] Preserve Dataset's schema metadata on column projection

fsaintjacques opened a new pull request #7474:
URL: https://github.com/apache/arrow/pull/7474


   Scanner does not preserve the original schema metadata when columns are projected.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques closed pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Posted by GitBox <gi...@apache.org>.
fsaintjacques closed pull request #7474:
URL: https://github.com/apache/arrow/pull/7474


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#issuecomment-645904783


   So this was actually a "time bomb" in the extension type testing code: we were using a test extension type with the same name than an extension type that in the meantime has been implemented in pandas. So depending on whether pandas already registered this type with pyarrow or not, would trigger this error "A type extension with name pandas.period already defined". 
   
   And using the pandas parquet functionality, triggers pandas to register this extension type. So until now, when running the tests in normal order, no test has been using pandas' parquet functions, before testing `test_extension_type.py`. But with this PR, a test using pandas' `to_parquet` was added to `test_dataset.py`, which is run before `test_extension_type.py` .. and which thus triggered the failures.
   
   If I explicitly tell pytest to run eg ``test_parquetpy`` first and then `test_extension_type.py`, I get the exact same failure on master.
   
   TLDR: pushed a fix to rename our test extension type class ;)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#discussion_r442102463



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1566,3 +1566,21 @@ def test_parquet_dataset_factory_partitioned(tempdir):
     result = result.to_pandas().sort_values("f1").reset_index(drop=True)
     expected = table.to_pandas().drop(columns=["part"])
     pd.testing.assert_frame_equal(result, expected)
+
+
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_dataset_schema_metadata(tempdir):
+    # ARROW-8802
+    df = pd.DataFrame({'a': [1, 2, 3]})
+    path = tempdir / "test.parquet"
+    df.to_parquet(path)
+    dataset = ds.dataset(path)
+
+    schema = dataset.to_table().schema
+    projected_schema = dataset.to_table(columns=["a"]).schema
+
+    # ensure the pandas metadata is included in the schema
+    assert b"pandas" in schema.metadata

Review comment:
       added an additional assert to ensure the "pandas" key is actually present, because if for some reason we accidentally remove it in both cases, this test won't detect that as both schema's are still identical (both missing the metadata). 
   (although I assume we have other tests that would start failing then)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#issuecomment-645570471


   https://issues.apache.org/jira/browse/ARROW-8802


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#issuecomment-645904783


   So this was actually a "time bomb" in the extension type testing code: we were using a test extension type with the same name than an extension type that in the meantime has been implemented in pandas. So depending on whether pandas already registered this type with pyarrow or not, would trigger this error "A type extension with name pandas.period already defined". 
   
   And using the pandas parquet functionality, triggers pandas to register this extension type. So until now, when running the tests in normal order, no test has been using pandas' parquet functions, before testing `test_extension_type.py`. But with this PR, a test using pandas' `to_parquet` was dded to `test_dataset.py`, which is run before `test_extension_type.py` ..
   
   If I explicitly tell pytest to run eg ``test_parquetpy`` first and then `test_extension_type.py`, I get the exact same failure on master.
   
   TLDR: pushed a fix to rename our test extension type class ;)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#issuecomment-645595500


   @jorisvandenbossche I'm puzzled by the error. Could it be an upstream pandas change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org