You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/14 13:22:45 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

jorisvandenbossche opened a new pull request #8912:
URL: https://github.com/apache/arrow/pull/8912


   The C++ `FileSystemDatasetFactory::Finish` method handles the schema inference or validation with two options: `InspectOptions::fragments` to indicate the number of fragments to use when inferring *or* validating the schema (default of 1), and the `FinishOptions::validate_fragments` to indicate whether to validate the specified schema (when not inferred). 
   
   For now, I decided to combine this in a single keyword on the Python side (`validate_schema`). This avoids adding 2 inter-dependent keywords for this, and makes it easier to express some typical use cases (eg validate the specified schema with all fragments is now `validate_schema=True` instead of `validate_schema=True, fragments=-1`). On the other hand, it gives a single keyword that accepts both boolean or int (which is not super clean). So this is certainly up for discussion.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-975663089


   @jorisvandenbossche Are you planning to push this forward?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-818051532


   Is this still targeted for 4.0? It needs a rebase if so, but otherwise looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-880815839


   Yeah, sorry for the slow follow-up here. It was on my to do list to have a look at today.
   
   > Naming it unify_schemas or infer_schemas or similar makes sense to me. I see only three main cases:
   > ...
   > - Schema is inferred (nothing to validate against!)
   
   But for this last case, you might still have the options of inferring from the first fragment, or reading the schema of all fragments and unifying them (or erroring when they can't be unified).
   
   So if we have eg a `infer_schemas`, I suppose we want to keep the different types of arguments what I now did for `validate_schema` (True meaning infer schema of all fragments, and an integer meaning the number of fragments to check).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-880801883


   @jorisvandenbossche just a gentle nudge :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-820499099


   Rebased now.
   
   @lidavidm I am still a bit in doubt about the exact API. 
   For example, R has a `unify_schemas` keyword in their `open_dataset` function (https://arrow.apache.org/docs/r/reference/open_dataset.html). I find that name a bit more descriptive for the use case of "infer schema by checking all fragments", since then in practice it unifies the schema of all fragments (while with the current PR, you get that by using `validate_schema=True`). 
   On the other hand, R doesn't (currently) give a way to validate a specified schema, but not sure how useful that is in practice (since you would get the same error when starting to scan as well, if the specified schema doesn't match).
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-820702988


   Naming it `unify_schemas` or `infer_schemas` or similar makes sense to me. I see only three main cases:
   - Schema is both given and inferred (and validation should always take place, else why do both?)
   - Schema is given (nothing to validate against, at least until scan time)
   - Schema is inferred (nothing to validate against!)
   So to me it makes more sense to have it be about inference than validation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#discussion_r542454058



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2240,6 +2240,89 @@ def test_dataset_project_null_column(tempdir):
     assert dataset.to_table().equals(expected)
 
 
+@pytest.mark.parquet
+def test_dataset_validate_schema_keyword(tempdir):
+    # ARROW-8221
+    import pyarrow.parquet as pq
+
+    basedir = tempdir / "dataset_mismatched_schemas"
+    basedir.mkdir()
+
+    table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
+    pq.write_table(table1, basedir / "data1.parquet")
+    table2 = pa.table({'a': ["a", "b", "c"], 'b': [1, 2, 3]})
+    pq.write_table(table2, basedir / "data2.parquet")
+
+    msg_scanning = "matching names but differing types"
+    msg_inspecting = "Unable to merge: Field a has incompatible types"
+
+    # default (inspecting first fragments) works, but fails scanning
+    dataset = ds.dataset(basedir)
+    assert dataset.schema.equals(table1.schema)

Review comment:
       Do datasets guarantee that the first file in alphabetical order is used to infer the schema?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-867006635


   @jorisvandenbossche What is the status on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-880819457


   That sounds reasonable to me. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#issuecomment-744455987


   https://issues.apache.org/jira/browse/ARROW-8221


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8912: ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8912:
URL: https://github.com/apache/arrow/pull/8912#discussion_r542500407



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2240,6 +2240,89 @@ def test_dataset_project_null_column(tempdir):
     assert dataset.to_table().equals(expected)
 
 
+@pytest.mark.parquet
+def test_dataset_validate_schema_keyword(tempdir):
+    # ARROW-8221
+    import pyarrow.parquet as pq
+
+    basedir = tempdir / "dataset_mismatched_schemas"
+    basedir.mkdir()
+
+    table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
+    pq.write_table(table1, basedir / "data1.parquet")
+    table2 = pa.table({'a': ["a", "b", "c"], 'b': [1, 2, 3]})
+    pq.write_table(table2, basedir / "data2.parquet")
+
+    msg_scanning = "matching names but differing types"
+    msg_inspecting = "Unable to merge: Field a has incompatible types"
+
+    # default (inspecting first fragments) works, but fails scanning
+    dataset = ds.dataset(basedir)
+    assert dataset.schema.equals(table1.schema)

Review comment:
       Yes, the file paths get sorted:
   
   https://github.com/apache/arrow/blob/48fee6672bd8f740cfde9efdec0004641bf462c2/cpp/src/arrow/dataset/discovery.cc#L205
   
   (now, whether this should maybe rather be a "natural" sort is another issue ..)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org