You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "mpimenov (via GitHub)" <gi...@apache.org> on 2024/02/13 18:54:28 UTC

[I] [C++] Possible data race when reading metadata of a parquet file [arrow]

mpimenov opened a new issue, #40068:
URL: https://github.com/apache/arrow/issues/40068

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   The first and the last line of this block of code access the same ```metadata``` variable but only one of them does so holding a lock.
   I assume this means the other one should too.
   There are some other places in this file that access metadata in tricky ways (e.g. it is not clear from a first glance at a method whether nullptr is allowed or not). They could also race.
   https://github.com/apache/arrow/blob/0dbbd43ca9133912d1809394727784560cc5e797/cpp/src/arrow/dataset/file_parquet.cc#L607-L618
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1943479691

   > The first and the last line of this block of code access the same `metadata` variable but only one of them does so holding a lock.
   
   1) Which metadata variable?
   2) Which lock are you talking about?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "mpimenov (via GitHub)" <gi...@apache.org>.
mpimenov commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1944213925

   Arrow is built from commit 3c655df6d459f50cc9734b4f4da31c84ece6c030
   [tsan-40068.txt](https://github.com/apache/arrow/files/14283787/tsan-40068.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1943532750

   Hmm, then please post a corresponding TSAN report, this would help understand the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1943498089

   Well, I'm not sure why it would be a problem in the snippet above. `metadata()` in line 607 returns before `EnsureCompleteMetadata()` in line 618 executes.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "mpimenov (via GitHub)" <gi...@apache.org>.
mpimenov commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1943530474

   In this call, sure. But by the same reasoning the locking is not needed at all.
   I assume (and I've observed it in TSAN reports) that this method can be called from multiple threads, with the intention of caching metadata by the first thread that arrives to read it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "mpimenov (via GitHub)" <gi...@apache.org>.
mpimenov commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1944285079

   Dataset creation itself looks quite vanilla to me
   ```
     std::vector<std::string> paths = FillPaths();
     arrow::dataset::FileSystemFactoryOptions options;
     options.partitioning = arrow::dataset::HivePartitioning::MakeFactory();
     options.partition_base_dir = root_path;
     options.exclude_invalid_files = true;
     return arrow::dataset::FileSystemDatasetFactory::Make(
                std::make_shared<arrow::fs::LocalFileSystem>(),
                paths,
                std::make_shared<arrow::dataset::ParquetFileFormat>(),
                options
     )
         .ValueOrDie()
         ->Finish()
         .ValueOrDie();
   ```
   
   However, when reading it I do this (for filtering columns and such)
   
   ```
     arrow::dataset::internal::Initialize();
     std::vector<std::shared_ptr<arrow::Field>> fields;
     for (const auto& i : dataset->schema()->fields()) {
       if (options.fields_.empty() || contains(options.fields_, i->name())) {
         fields.push_back(i);
       }
     }
     dataset = dataset->ReplaceSchema(std::make_shared<arrow::Schema>(fields, dataset->schema()->metadata())).ValueOrDie();
   ```
   I don't know if this is idiomatic enough


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1944297766

   That looks reasonable to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1944283123

   In any case, it would probably be good for `metadata()` to take the lock before reading its result value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #40068: [C++] Possible data race when reading metadata of a parquet file
URL: https://github.com/apache/arrow/issues/40068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "mpimenov (via GitHub)" <gi...@apache.org>.
mpimenov commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1943482980

   1. https://github.com/apache/arrow/blob/0dbbd43ca9133912d1809394727784560cc5e797/cpp/src/arrow/dataset/file_parquet.h#L221
   2. https://github.com/apache/arrow/blob/0dbbd43ca9133912d1809394727784560cc5e797/cpp/src/arrow/dataset/file_parquet.cc#L783


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1944256761

   `ScanBatchesAsync` is only called from `FragmentToBatches`, which in turn can only be called from a "scan" node.
   
   One possible explanation would be if `Dataset::GetFragments` returns the same fragment instance twice. How did you create your dataset?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Possible data race when reading metadata of a parquet file [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #40068:
URL: https://github.com/apache/arrow/issues/40068#issuecomment-1944240567

   Thanks! So, trying to sum it up:
   * main thread has an Acero source node whose `StartProducing` calls `FragmentsToBatches` calls `ParquetFileFormat::ScanBatchesAsync`
   ```
       #2 arrow::dataset::ParquetFileFormat::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&, std::shared_ptr<arrow::dataset::FileFragment> const&) const $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:607:36 (arrow_reproduce+0x126e20b)
       #3 arrow::dataset::FileFragment::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&) $ARROW_ROOT/cpp/src/arrow/dataset/file_base.cc:188:19 (arrow_reproduce+0x1156715)
       #4 arrow::dataset::(anonymous namespace)::FragmentToBatches(arrow::Enumerated<std::shared_ptr<arrow::dataset::Fragment>> const&, std::shared_ptr<arrow::dataset::ScanOptions> const&) $ARROW_ROOT/cpp/src/arrow/dataset/scanner.cc:305:3 (arrow_reproduce+0x11c4fa1)
       #5 arrow::dataset::(anonymous namespace)::FragmentsToBatches(std::function<arrow::Future<std::shared_ptr<arrow::dataset::Fragment>> ()>, std::shared_ptr<arrow::dataset::ScanOptions> const&)::$_0::operator()(arrow::Enumerated<std::shared_ptr<arrow::dataset::Fragment>> const&) const $ARROW_ROOT/cpp/src/arrow/dataset/scanner.cc:333:36 (arrow_reproduce+0x11c4fa1)
   ```
   
   * Arrow worker thread executes a Future callback from `ParquetFileReader::OpenAsync`'s own callback, that in turn calls `ParquetFileFormat::ScanBatchesAsync`
   ```
       #4 arrow::dataset::ParquetFileFragment::SetMetadata(std::shared_ptr<parquet::FileMetaData>, std::shared_ptr<parquet::arrow::SchemaManifest>) $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:819:13 (arrow_reproduce+0x12764c2)
       #5 arrow::dataset::ParquetFileFragment::EnsureCompleteMetadata(parquet::arrow::FileReader*) $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:811:10 (arrow_reproduce+0x1275b25)
       #6 arrow::dataset::ParquetFileFormat::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&, std::shared_ptr<arrow::dataset::FileFragment> const&) const::$_0::operator()(std::shared_ptr<parquet::arrow::FileReader> const&) $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:618:5 (arrow_reproduce+0x1283f68)
   ```
   
   What seems weird is that two `ScanBatchesAsync` calls would happen on the _same_ fragment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org