You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "vibhatha (via GitHub)" <gi...@apache.org> on 2023/02/01 12:02:25 UTC

[GitHub] [arrow] vibhatha opened a new pull request, #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

vibhatha opened a new pull request, #33969:
URL: https://github.com/apache/arrow/pull/33969

   This PR includes a minor modification to `HivePartitioningFactory` to fix an error caused when reading non-partitioned data in hive format. And test cases for Python and C++ has been added to evaluate the update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on PR #33969:
URL: https://github.com/apache/arrow/pull/33969#issuecomment-1437817323

   > Marking "request changes" while we investigate whether there is a bug with specifying the schema or not.
   
   Yeah, I will investigate this further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100804152


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I agree, we don't have to provide the schema. In the PR we added this, I have included the schema. I think this could be my misunderstanding. 
   
   But my concern is, if one does specify that schema, with this new change, we are getting null values in the `exp_meta` column. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1098090692


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -1107,5 +1108,88 @@ TEST(TestStripPrefixAndFilename, Basic) {
                                           "year=2019/month=12/day=01"));
 }
 
+TEST_F(TestPartitioning, HivePartitionReadEvaluation) {

Review Comment:
   should we include another test case? Or this change would be satisfactory?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100926556


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   > In theory we don't need to pass the schema here, but if one does, we get invalid results
   
   In the current code?  Or in this PR?  Or both?
   
   Both passing and not passing the schema should work.  If passing the schema gives invalid results then I would ask we follow @jorisvandenbossche 's advice and create a test that isolates this behavior as much as possible.  This may require further debugging to figure out exactly what happened.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1098048979


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   Not at all, but there was a bug when the schema was specified. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- closed pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "amol- (via GitHub)" <gi...@apache.org>.

amol- closed pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema
URL: https://github.com/apache/arrow/pull/33969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099825533


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   Ah it is simple. Basically the data in the `exp_meta` is all `null` values, it shouldn't be the case. 
   What I observed was, when we create the scheme, we should pass the schema in that case. Maybe that is a wrong fix. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099805280


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   The issue is with the existing test case. When the schema is provided that case failed, but I realized the fix was to handle it where the `_ensure_scheme` is defined. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1098090441


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -1107,5 +1108,88 @@ TEST(TestStripPrefixAndFilename, Basic) {
                                           "year=2019/month=12/day=01"));
 }
 
+TEST_F(TestPartitioning, HivePartitionReadEvaluation) {

Review Comment:
   I removed the added test case and updated with your suggestion. I understand your concern. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1097677559


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -827,7 +827,7 @@ class HivePartitioningFactory : public KeyValuePartitioningFactory {
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
     if (dictionaries_.empty()) {
-      return std::make_shared<HivePartitioning>(schema, dictionaries_);
+      return std::make_shared<HivePartitioning>(arrow::schema({}), dictionaries_);

Review Comment:
   Nit: Is the `arrow::` prefix necessary here?



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1925,3 +1925,20 @@ def test_write_to_dataset_kwargs_passed(tempdir, write_dataset_kwarg):
         pq.write_to_dataset(table, path, **{key: arg})
         _name, _args, kwargs = mock_write_dataset.mock_calls[0]
         assert kwargs[key] == arg
+
+
+def test_partition_schema_validity(tempdir):
+    data = {
+        "a": [1, 2, 1, 2, 3, 4],
+        "b": [10, 20, 30, 40, 50, 60]
+    }
+
+    table = pa.Table.from_pydict(data)
+    pq.write_table(table, tempdir / "table.parquet")

Review Comment:
   A full round-trip test here is ok as it is much simpler to write such a test and we don't fully expose `PartitioningFactory` so it wouldn't be possible to do the simpler test.



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   Does the schema even need to be specified here?



##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -1107,5 +1108,88 @@ TEST(TestStripPrefixAndFilename, Basic) {
                                           "year=2019/month=12/day=01"));
 }
 
+TEST_F(TestPartitioning, HivePartitionReadEvaluation) {

Review Comment:
   This test is too complex.  Could you create a narrower unit test that focuses on the needed behavior?  For example, there should be no need to create a filesystem and actually write files if your goal is to test the partitioning factory.  The partitioning factory takes strings & schemas as input and outputs a partitioning object.
   
   For example, you can reproduce the issue with this:
   
   ```diff
   diff --git a/cpp/src/arrow/dataset/partition_test.cc b/cpp/src/arrow/dataset/partition_test.cc
   index 3e681a9cb..0e22152e0 100644
   --- a/cpp/src/arrow/dataset/partition_test.cc
   +++ b/cpp/src/arrow/dataset/partition_test.cc
   @@ -83,8 +83,14 @@ class TestPartitioning : public ::testing::Test {
      void AssertInspect(const std::vector<std::string>& paths,
                         const std::vector<std::shared_ptr<Field>>& expected) {
        ASSERT_OK_AND_ASSIGN(auto actual, factory_->Inspect(paths));
   -    ASSERT_EQ(*actual, Schema(expected));
   -    ASSERT_OK_AND_ASSIGN(partitioning_, factory_->Finish(actual));
   +    Schema expected_schema(expected);
   +    ASSERT_EQ(*actual, expected_schema);
   +    // The dataset will usually have more than just partition fields.  The partitioning
   +    // schema should ignore these.
   +    std::vector<std::shared_ptr<Field>> all_fields(expected);
   +    all_fields.push_back(field("field_unrelated_to_partitioning", int8()));
   +    ASSERT_OK_AND_ASSIGN(partitioning_, factory_->Finish(schema(std::move(all_fields))));
   +    AssertSchemaEqual(expected_schema, *partitioning_->schema());
      }
    
      void AssertPartition(const std::shared_ptr<Partitioning> partitioning,
   @@ -571,6 +577,11 @@ TEST_F(TestPartitioning, DiscoverHiveSchema) {
      // missing path segments will not cause an error
      AssertInspect({"/alpha=0/beta=1", "/beta=2/alpha=3", "/gamma=what"},
                    {Int("alpha"), Int("beta"), Str("gamma")});
   +
   +  // No partitioning information should yield an empty partitioning even if
   +  // there are fields in the dataset schema
   +  AssertInspect({"/foo/chunk.parquet", "/bar/chunk.parquet", "/baz"}, {});
   +  AssertInspect({"chunk.parquet"}, {});
    }
    
    TEST_F(TestPartitioning, HiveDictionaryInference) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1098079831


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -827,7 +827,7 @@ class HivePartitioningFactory : public KeyValuePartitioningFactory {
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
     if (dictionaries_.empty()) {
-      return std::make_shared<HivePartitioning>(schema, dictionaries_);
+      return std::make_shared<HivePartitioning>(arrow::schema({}), dictionaries_);

Review Comment:
   There is an arugment `schema` to the function, so it would need either refactoring argument or using the namespace. Any thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on PR #33969:
URL: https://github.com/apache/arrow/pull/33969#issuecomment-1419357811

   cc @westonpace there are a few CIs failing, not sure if they are related. Could you please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099794334


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   > It works without providing the schema
   
   But a schema_already_ was specified before this PR? I don't understand what this PR is changing about this test. 
   
   It might be clearer if you add a new test that fails on current master, and is fixed by this PR.



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   > It works without providing the schema
   
   But a schema _already_ was specified before this PR? I don't understand what this PR is changing about this test. 
   
   It might be clearer if you add a new test that fails on current master, and is fixed by this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099137559


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I am a trying to understand the diff here, but did something actually change? It seems that the schema that was specified manually before should be identical to `dt_table.schema`, and so this change is just a simplification? 
   But what is then the "bug when the schema was specified" you are mentioning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099798800


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I have added a new test to evaluate the failing idea in the `test_dataset.py`, but an existing test case fails with the current modification. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100777571


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -1107,5 +1108,88 @@ TEST(TestStripPrefixAndFilename, Basic) {
                                           "year=2019/month=12/day=01"));
 }
 
+TEST_F(TestPartitioning, HivePartitionReadEvaluation) {

Review Comment:
   Not quite. But I felt the need for a roundtrip test. Other than that, I don’t have strong concerns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099410974


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   In a previous PR I added this test case. It works without providing the schema (with the current change) and we don't need the change I did in the `_ensure_scheme`. But if a user provide a schema in this method, I get nulls for the `exp_meta` column after making this change. That was the bug encountered. I wasn't quite confident changing the test case to make things work. Do you agree with the C++ change to fix this bug? If so, for Python we could just leave the `schema`, but in case someone provides it there, we could run into trouble. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099805646


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I could be wrong here. Any thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099132752


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1925,3 +1925,20 @@ def test_write_to_dataset_kwargs_passed(tempdir, write_dataset_kwarg):
         pq.write_to_dataset(table, path, **{key: arg})
         _name, _args, kwargs = mock_write_dataset.mock_calls[0]
         assert kwargs[key] == arg
+
+
+def test_partition_schema_validity(tempdir):

Review Comment:
   Can you move this tests to `tests/test_dataset.py`? (this file is specifically for the ParquetDataset implemention part of `pyarrow.parquet`)



##########
python/pyarrow/dataset.py:
##########
@@ -437,7 +437,7 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
-    partitioning = _ensure_partitioning(partitioning)
+    partitioning = _ensure_partitioning(partitioning, schema)

Review Comment:
   This doesn't seem right to me, since the user-specified `schema` is meant to be the full dataset schema, while the `partitioning` functions takes the schema of the only the partitioning (although it is then a bit surprising that no tests fails). 
   
   Is this change needed to fix the issue (apart from the fix in C++)? 
   (on first sight, it doesn't seem needed to have the python test you added passing)
   



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I am a trying to gauge the diff here, but did something actually change? It seems that the schema that was specified manually before should be identical to `dt_table.schema`, and so this change is just a simplification? 
   But what is then the "bug when the schema was specified" you are mentioning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100776857


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -827,7 +827,7 @@ class HivePartitioningFactory : public KeyValuePartitioningFactory {
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
     if (dictionaries_.empty()) {
-      return std::make_shared<HivePartitioning>(schema, dictionaries_);
+      return std::make_shared<HivePartitioning>(arrow::schema({}), dictionaries_);

Review Comment:
   Sure. I will.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1101604902


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I am going to remove my review request until we get this sorted out, just for my own book-keeping.  Feel free to re-request as soon as this is ready for another look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100925260


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -1107,5 +1108,88 @@ TEST(TestStripPrefixAndFilename, Basic) {
                                           "year=2019/month=12/day=01"));
 }
 
+TEST_F(TestPartitioning, HivePartitionReadEvaluation) {

Review Comment:
   We have quite a few roundtrip tests for dataset reading and writing already (python's test_dataset has quite a few) so I'd be opposed to another one unless there was a specific reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100805090


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   In theory we don't need to pass the schema here, but if one does, we get invalid results. I am not quite sure which one to change :thinking: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100781597


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I am also confused.  Here is my understanding of this test:
   
    * A user wants to partition on column exp_meta
    * The values in exp_meta contain slashes
    * We should be able to partition on that column and still write and then read the dataset
   
   The files that will get created will be of the shape:
   
   ```
   <tmpdir>/slash-writer-x/exp_meta=experiment%2FA%2Ff.csv/chunk-0.parquet
   ```
   
   There should be no need to specify a schema manually to read this dataset.
   
   It should also be harmless to specify a schema.  However, doing so makes the test less clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100775778


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -827,7 +827,7 @@ class HivePartitioningFactory : public KeyValuePartitioningFactory {
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
     if (dictionaries_.empty()) {
-      return std::make_shared<HivePartitioning>(schema, dictionaries_);
+      return std::make_shared<HivePartitioning>(arrow::schema({}), dictionaries_);

Review Comment:
   Let's disambiguate then.  Can you change the argument name to `dataset_schema`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100776088


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -1107,5 +1108,88 @@ TEST(TestStripPrefixAndFilename, Basic) {
                                           "year=2019/month=12/day=01"));
 }
 
+TEST_F(TestPartitioning, HivePartitionReadEvaluation) {

Review Comment:
   This is fine.  It covers your change.  Are there other branches you feel we are not testing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1100942119


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   Yes, I will further debug and see how to replicate this with an isolated test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1098049346


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   To evaluate that, I left it as it is but simplified it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099412661


##########
python/pyarrow/dataset.py:
##########
@@ -437,7 +437,7 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
-    partitioning = _ensure_partitioning(partitioning)
+    partitioning = _ensure_partitioning(partitioning, schema)

Review Comment:
   Yes this wasn't needed for the Python test case I added in this PR. But it was needed for the previous one when the schema is specified. If it is not specified even with this change it works fine. But when specified I get `null` values in the `exp_meta` column. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099817496


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   > The issue is with the existing test case. When the schema is provided that case failed
   
   Can you show the error that you got for this test before you made the changes to `_ensure_partitioning`?
   
   (and to be clear, the test also fails without the modifications to the test code? As those changes are only stylistic?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Posted by "amol- (via GitHub)" <gi...@apache.org>.

amol- commented on PR #33969:
URL: https://github.com/apache/arrow/pull/33969#issuecomment-1490642953

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org