You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/04/14 20:03:22 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

wjones127 opened a new pull request, #35146:
URL: https://github.com/apache/arrow/pull/35146

   ### Rationale for this change
   
   This has been a long-standing TODO.
   
   ### What changes are included in this PR?
   
   
   
   ### Are these changes tested?
   
   
   ### Are there any user-facing changes?
   
   **This PR includes breaking changes to public APIs.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1546786132

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/81d5e0207fb64cbd97aa10e18807feaf...74b1828557fe4607b5ed65b10eaac570/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35146:
URL: https://github.com/apache/arrow/pull/35146#discussion_r1191488592


##########
cpp/src/parquet/properties.h:
##########
@@ -938,13 +937,13 @@ class PARQUET_EXPORT ArrowWriterProperties {
     /// list types (default "item"), will use "entries", as is specified in

Review Comment:
   Oops.
   
   ```c++
   MapType::MapType(std::shared_ptr<DataType> key_type, std::shared_ptr<Field> item_field,
                    bool keys_sorted)
       : MapType(::arrow::field("key", std::move(key_type), false), std::move(item_field),
                 keys_sorted) {}
   
   MapType::MapType(std::shared_ptr<Field> key_field, std::shared_ptr<Field> item_field,
                    bool keys_sorted)
       : MapType(
             ::arrow::field("entries",
                            struct_({std::move(key_field), std::move(item_field)}), false),
             keys_sorted) {}
   ```
   
   Found `"entries"` in `arrow/type.cc`, and spec is using "element". 
   
   And here maybe we're using "entries":
   
   ```c++
   /// In a field with Map type, the field has a child Struct field, which then
   /// has two children: key type and the second the value type. The names of the
   /// child fields may be respectively "entries", "key", and "value", but this is
   /// not enforced.
   ///
   /// Map
   /// ```text
   ///   - child[0] entries: Struct
   ///     - child[0] key: K
   ///     - child[1] value: V
   /// ```
   /// Neither the "entries" field nor the "key" field may be nullable.
   ```
   
   According to spec. Map should use "key_value". Seems it's incorrect? Am I right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35146:
URL: https://github.com/apache/arrow/pull/35146#discussion_r1191447731


##########
cpp/src/parquet/properties.h:
##########
@@ -938,13 +937,13 @@ class PARQUET_EXPORT ArrowWriterProperties {
     /// list types (default "item"), will use "entries", as is specified in

Review Comment:
   It seems this docstring is incorrect as per the Parquet spec, or am I reading it wrong?
   ```suggestion
       /// list types (default "item"), will use "element", as is specified in
   ```
   
   See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1514998790

   In C++, code shows `force`. But when I go through code in arrow-rs, it doesn't say is force. Maybe we should ask the parquet maillist later?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1515178867

   > In C++, code shows `force`. But when I go through code in arrow-rs, it doesn't say is force. And though parquet-format use "element" in description, it doesn't says it's forced. Maybe we should ask the parquet maillist later?
   
   The rules were laid out earlier here:
   
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   
   Before this PR I did work to make sure that the Arrow C++ implementation (1) didn't care about the field names in equality comparison ([PR](https://github.com/apache/arrow/pull/14847)) and (2) could cheaply cast between types that differed only in field names ([PR](https://github.com/apache/arrow/pull/14198)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1534916578

   Is it time to revive this PR and check it in? @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35146:
URL: https://github.com/apache/arrow/pull/35146


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1546785661

   Benchmark runs are scheduled for baseline = 1624d5aaf4f524487079066dc730176d82b986f5 and contender = e324f9ab8aaa7f51d7a14d4d3f1a9713b866196b. e324f9ab8aaa7f51d7a14d4d3f1a9713b866196b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/890d5d4bf6804b6faa32f8f7d8b99d82...cca66b8044824339bdcb8cd456616348/)
   [Failed :arrow_down:1.66% :arrow_up:0.06%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1dd57bc8bbec468c871bc894b7cc7b8f...44820e1b277642f6a11788a30317c2a5/)
   [Finished :arrow_down:1.52% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/81d5e0207fb64cbd97aa10e18807feaf...74b1828557fe4607b5ed65b10eaac570/)
   [Finished :arrow_down:0.18% :arrow_up:0.06%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6ddb7b6bcf014e6db1fa98fab53464ad...72e6f533a14347339451132dcf641a8c/)
   Buildkite builds:
   [Finished] [`e324f9ab` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2869)
   [Failed] [`e324f9ab` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2905)
   [Finished] [`e324f9ab` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2870)
   [Finished] [`e324f9ab` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2895)
   [Finished] [`1624d5aa` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2868)
   [Finished] [`1624d5aa` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2904)
   [Finished] [`1624d5aa` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2869)
   [Finished] [`1624d5aa` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2894)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #35146:
URL: https://github.com/apache/arrow/pull/35146#discussion_r1191710190


##########
cpp/src/parquet/properties.h:
##########
@@ -938,13 +937,13 @@ class PARQUET_EXPORT ArrowWriterProperties {
     /// list types (default "item"), will use "entries", as is specified in

Review Comment:
   You are right, that should be "element".
   
   Parquet will always have them as "element". Arrow has it's own spec, but isn't prescriptive for the name of the list type. Either way, I don't think we should care too much about this detail; these field names should be pretty much invisible to users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1509171588

   * Closes: #29781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35146:
URL: https://github.com/apache/arrow/pull/35146#discussion_r1168121298


##########
cpp/src/parquet/properties.h:
##########
@@ -819,8 +819,7 @@ class PARQUET_EXPORT ArrowWriterProperties {
           coerce_timestamps_unit_(::arrow::TimeUnit::SECOND),
           truncated_timestamps_allowed_(false),
           store_schema_(false),
-          // TODO: At some point we should flip this.

Review Comment:
   The line 320 in the parquet.rst doc should also be changed.
   ```
      std::shared_ptr<ArrowWriterProperties> arrow_props = ArrowWriterProperties::Builder()
         .enable_deprecated_int96_timestamps() // default False
         ->store_schema() // default False
         ->enable_compliant_nested_types() // default False
         ->build();
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35146:
URL: https://github.com/apache/arrow/pull/35146#discussion_r1167964223


##########
cpp/src/parquet/properties.h:
##########
@@ -819,8 +819,7 @@ class PARQUET_EXPORT ArrowWriterProperties {
           coerce_timestamps_unit_(::arrow::TimeUnit::SECOND),
           truncated_timestamps_allowed_(false),
           store_schema_(false),
-          // TODO: At some point we should flip this.

Review Comment:
   ```
   /// This is disabled by default, but will be enabled by default in future.
   ```
   
   Should we change this in Loc 880?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35146: GH-29781: [C++][Parquet] Switch to use compliant nested types by default

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35146:
URL: https://github.com/apache/arrow/pull/35146#issuecomment-1515630445

   Thanks! So, this flag could make converting and reading a bit cheaper?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org