You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/23 20:18:20 UTC

[GitHub] [arrow] bkietz opened a new pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

bkietz opened a new pull request #7526:
URL: https://github.com/apache/arrow/pull/7526


   The physical schema is required to validate predicates used for filtering row groups based on statistics.
   
   It can also be explicitly provided to ensure that if no row groups satisfy the predicate no I/O is done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444914459



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,13 +355,20 @@ static inline Result<std::vector<RowGroupInfo>> AugmentRowGroups(
   return row_groups;
 }
 
-Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
-    const FileSource& source, std::shared_ptr<ScanOptions> options,
-    std::shared_ptr<ScanContext> context, std::vector<RowGroupInfo> row_groups) const {
+Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions> options,
+                                                     std::shared_ptr<ScanContext> context,
+                                                     FileFragment* fragment) const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast<const ParquetFileFragment*>(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+    // physical_schema should be cached at this point
+    ARROW_ASSIGN_OR_RAISE(auto physical_schema, fragment->ReadPhysicalSchema());
+    RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
       Fragments created by ParquetDatasetFactory will have their physical schema populated on construction: https://github.com/apache/arrow/pull/7526/files#diff-b7aed3b5c0adacff778c6930cc6a03e4R646-R650
   
   After that calling `ReadPhysicalSchema` on those fragments will be an IO-free operation https://github.com/apache/arrow/pull/7526/files#diff-ce896c4eb96f3a050aec23151adf58e8R42-R48




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-648401641


   https://issues.apache.org/jira/browse/ARROW-9146


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques closed pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
fsaintjacques closed pull request #7526:
URL: https://github.com/apache/arrow/pull/7526


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445512267



##########
File path: cpp/src/arrow/util/mutex.h
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
       We cannot use `std::mutex` in public headers due to CPP/CLI forbidding use of that header https://docs.microsoft.com/en-us/cpp/standard-library/mutex#remarks
   We don't support compilation of Arrow with the `/clr` option so it's acceptable to use `std::mutex` in source files and private headers (including `arrow/util/mutex.cc`). However arrow headers may be included by projects which *are* using `/clr` and usage of `std::mutex` in those public header files would break those projects.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444492049



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,13 +355,20 @@ static inline Result<std::vector<RowGroupInfo>> AugmentRowGroups(
   return row_groups;
 }
 
-Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
-    const FileSource& source, std::shared_ptr<ScanOptions> options,
-    std::shared_ptr<ScanContext> context, std::vector<RowGroupInfo> row_groups) const {
+Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions> options,
+                                                     std::shared_ptr<ScanContext> context,
+                                                     FileFragment* fragment) const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast<const ParquetFileFragment*>(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+    // physical_schema should be cached at this point
+    ARROW_ASSIGN_OR_RAISE(auto physical_schema, fragment->ReadPhysicalSchema());
+    RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
       Correct, this line can't be here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444914459



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,13 +355,20 @@ static inline Result<std::vector<RowGroupInfo>> AugmentRowGroups(
   return row_groups;
 }
 
-Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
-    const FileSource& source, std::shared_ptr<ScanOptions> options,
-    std::shared_ptr<ScanContext> context, std::vector<RowGroupInfo> row_groups) const {
+Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions> options,
+                                                     std::shared_ptr<ScanContext> context,
+                                                     FileFragment* fragment) const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast<const ParquetFileFragment*>(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+    // physical_schema should be cached at this point
+    ARROW_ASSIGN_OR_RAISE(auto physical_schema, fragment->ReadPhysicalSchema());
+    RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
       https://github.com/apache/arrow/pull/7526/files#diff-b7aed3b5c0adacff778c6930cc6a03e4R646-R650




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445482607



##########
File path: cpp/src/arrow/util/mutex.h
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
       we use `std:mutex` in several places in the code base, so this comment is not fully clear to me

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,13 +355,20 @@ static inline Result<std::vector<RowGroupInfo>> AugmentRowGroups(
   return row_groups;
 }
 
-Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
-    const FileSource& source, std::shared_ptr<ScanOptions> options,
-    std::shared_ptr<ScanContext> context, std::vector<RowGroupInfo> row_groups) const {
+Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions> options,
+                                                     std::shared_ptr<ScanContext> context,
+                                                     FileFragment* fragment) const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast<const ParquetFileFragment*>(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+    // physical_schema should be cached at this point
+    ARROW_ASSIGN_OR_RAISE(auto physical_schema, fragment->ReadPhysicalSchema());
+    RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
       To avoid confusion about this, should we rename ReadPhysicalSchema? Or instead of having `ReadPhysicalSchema` which caches and `ReadPhysicalSchemaImpl` that always reads, a `GetPhysicalSchema` that caches and `ReadPhysicalSchema` that reads?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600179


   Correct, we don't support the file changing underneath. For now we can preach YAGNI on this as it is a rare use case. If this is required, the consumer can create a custom dataset/fragment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649506352


   > I suppose we don't / don't want to support files being changed after the dataset object has been constructed anyways?
   
   No, we don't support this at all. IMHO if fragments are modified that calls for the viewing `Dataset` to be reconstructed. @fsaintjacques 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445516148



##########
File path: cpp/src/arrow/util/mutex.h
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
       Thanks for the explanation! maybe add: "can't use it directly *in public headers*"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445096637



##########
File path: cpp/src/arrow/util/mutex.h
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
       We use it everywhere why is it a problem here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444509707



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,13 +355,20 @@ static inline Result<std::vector<RowGroupInfo>> AugmentRowGroups(
   return row_groups;
 }
 
-Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
-    const FileSource& source, std::shared_ptr<ScanOptions> options,
-    std::shared_ptr<ScanContext> context, std::vector<RowGroupInfo> row_groups) const {
+Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions> options,
+                                                     std::shared_ptr<ScanContext> context,
+                                                     FileFragment* fragment) const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast<const ParquetFileFragment*>(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+    // physical_schema should be cached at this point
+    ARROW_ASSIGN_OR_RAISE(auto physical_schema, fragment->ReadPhysicalSchema());
+    RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
       Fragments created using `_metadata` have their physical schema preloaded so they won't incur IO when no row group can satisfy a predicate




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-650242207


   This has a python lint failure: https://github.com/apache/arrow/pull/7526/checks?check_run_id=811470274#step:7:1377


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-650289073


   I'll merge after Appveyor passes, ignoring Travis


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445583091



##########
File path: cpp/src/arrow/util/mutex.h
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
       https://github.com/apache/arrow/pull/7526#discussion_r445512267




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600834


   The test is failing on windows. https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33736283/job/mshkl837y3b5v5u6


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444486902



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,13 +355,20 @@ static inline Result<std::vector<RowGroupInfo>> AugmentRowGroups(
   return row_groups;
 }
 
-Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
-    const FileSource& source, std::shared_ptr<ScanOptions> options,
-    std::shared_ptr<ScanContext> context, std::vector<RowGroupInfo> row_groups) const {
+Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions> options,
+                                                     std::shared_ptr<ScanContext> context,
+                                                     FileFragment* fragment) const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast<const ParquetFileFragment*>(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+    // physical_schema should be cached at this point
+    ARROW_ASSIGN_OR_RAISE(auto physical_schema, fragment->ReadPhysicalSchema());
+    RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
       Didn't look in detail yet (so might be misreading, it's late here ;)), but a quick note: in general, when Fragments were instantiated with ParquetFactory (from a `_metadata` file) with populated statistics, I think it should be possible to filter out fragments without any additional IO.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org