You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/01 21:40:07 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #2677: Switch to object_store crate (#2489)

tustvold opened a new pull request, #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677

   _Draft as needs a bit more polish and will benefit from some planned upstream changes to arrow-rs and object_store_
   
   # Which issue does this PR close?
   
   Closes #2489
   
    # Rationale for this change
   
   See ticket
   
   # What changes are included in this PR?
   
   Switches DataFusion to using object_store crate in place 
   
   # Are there any user-facing changes?
   
   Yes this moves to using the object_store crate.
   
   # Does this PR break compatibility with Ballista?
   
   Possibly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170101169

   @tustvold  -- given the work for #2226, is the the eventual plan to interleave IO and CPU decoding?
   
   I wonder if we can find some workaround so that @Ted-Jiang and his team doesn't lose performance while we continue to make progress (e.g. could we fetch to local file? or put in some hack for people who wanted to decode using a blocking IO thread or something)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r892627505


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -210,52 +202,20 @@ impl ExecutionPlan for ParquetExec {
             Some(proj) => proj,
             None => (0..self.base_config.file_schema.fields().len()).collect(),
         };
-        let partition_col_proj = PartitionColumnProjector::new(
-            Arc::clone(&self.projected_schema),
-            &self.base_config.table_partition_cols,
-        );
 
-        let object_store = context
-            .runtime_env()
-            .object_store(&self.base_config.object_store_url)?;
-
-        let stream = ParquetExecStream {
-            error: false,
+        let opener = ParquetOpener {
             partition_index,
-            metrics: self.metrics.clone(),
-            object_store,
-            pruning_predicate: self.pruning_predicate.clone(),
+            projection: Arc::from(projection),
             batch_size: context.session_config().batch_size,
-            schema: self.projected_schema.clone(),
-            projection,
-            remaining_rows: self.base_config.limit,
-            reader: None,
-            files: self.base_config.file_groups[partition_index].clone().into(),
-            projector: partition_col_proj,
-            adapter: SchemaAdapter::new(self.base_config.file_schema.clone()),
-            baseline_metrics: BaselineMetrics::new(&self.metrics, partition_index),
+            pruning_predicate: self.pruning_predicate.clone(),
+            table_schema: self.base_config.file_schema.clone(),
+            metrics: self.metrics.clone(),
         };
 
-        // Use spawn_blocking only if running from a tokio context (#2201)
-        match tokio::runtime::Handle::try_current() {
-            Ok(handle) => {
-                let (response_tx, response_rx) = tokio::sync::mpsc::channel(2);
-                let schema = stream.schema();
-                let join_handle = handle.spawn_blocking(move || {
-                    for result in stream {
-                        if response_tx.blocking_send(result).is_err() {
-                            break;
-                        }
-                    }
-                });
-                Ok(RecordBatchReceiverStream::create(
-                    &schema,
-                    response_rx,
-                    join_handle,
-                ))
-            }
-            Err(_) => Ok(Box::pin(stream)),
-        }
+        let stream =
+            FileStream::new(&self.base_config, partition_index, context, opener)?;

Review Comment:
   Parquet now uses the same FileStream interface as other formats



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170942273

   > I am not clear about the whereas master interleaves the IO and decoding i think master use block IO, decode must wait for IO. this patch uses interleaving with async function to reduce the blocked IO.
   
   Master interleaves IO at the page level, reading individual pages as required blocking the calling thread as it does so. This branch instead performs async IO fetching column chunks into memory without blocking threads, this is significantly better for object stores, but will perform "worse" for certain workloads accessing local files where the approach on master may be faster, but with the obvious drawback of blocking threads.
   
   > if we first integrated the object store abstraction into the repository.
   
   I would be fine waiting until the donation to arrow-rs goes through (https://github.com/influxdata/object_store_rs/issues/41) but I had hoped that given this intent had been clearly broadcast, rather than waiting the 3 or so weeks it will take to go through this process, we could just get this in. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

rdettai commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170848585

> That being said, I'm not really sure I agree that the object store abstraction is all that core to DataFusion. It is just an IO abstraction used at the edges of plans

That's quite of a lot files that got modified for switching an "IO abstraction used at the edge of plans" 😄. I also believe that reading the data in from files is very _crucial_ to an analytics query engine. Indeed it isn't _core_ in the sense that you can do things with your engine without it (reading in memory or streaming data...), but it is still one of its main use case and more importantly, a critical performance bottleneck. And as always with optimization, you sometime need to bend the separation of concern a bit to reach your goal, which means that you will need to tweak the abstraction to get the performance you want (as you can see with topics like prefetch strategies....). And this can be made more complicated if we refer to an external store that is not owned by us.

TL;DR: I would also be more comfortable with this change if we first integrated the object store abstraction into the repository.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1152740006

   > I think this is precisely what @tustvold is working towards
   
   Indeed, #2711 adds buffered prefetch of projected columns and non-pruned row groups, using the functionality added in https://github.com/apache/arrow-rs/pull/1803. Further with the work of @Ted-Jiang on ColumnIndex support https://github.com/apache/arrow-rs/issues/1705, we may in the not to distant future support page-level pushdown :tada:
   
   > will be out next week so waiting for his return is probably the wisest course of action
   
   I am out for the next week and a bit, and am not sure how much time I will have to work on this, but please do leave feedback and I'll get to it on my return :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170267958

   So here is where we stand with regards to this PR:
   
   ## Pros
   
   * Less range requests will be made to object storage, reducing latency and monetary costs
   * Threads will not be blocked on network IO
   * Does not make use of futures::block_on or tokio::spawn_blocking
   * Will integrate well with future work to reduce bytes fetched from object storage  - https://github.com/apache/arrow-rs/issues/1705
   * Fits with the longer-term vision of morsel-driven IO within DataFusion - #2504 
   
   ## Cons
   
   * Slightly higher memory usage for some queries as buffers encoded column chunks instead of reading pages on-demand
   * Queries to **local** files with column chunks containing large numbers of pages may be slower
   
   ## Conclusion
   
   I therefore think on-balance this PR represents a step forward, with the only regression mitigated by using smaller row groups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r892627505


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -210,52 +202,20 @@ impl ExecutionPlan for ParquetExec {
             Some(proj) => proj,
             None => (0..self.base_config.file_schema.fields().len()).collect(),
         };
-        let partition_col_proj = PartitionColumnProjector::new(
-            Arc::clone(&self.projected_schema),
-            &self.base_config.table_partition_cols,
-        );
 
-        let object_store = context
-            .runtime_env()
-            .object_store(&self.base_config.object_store_url)?;
-
-        let stream = ParquetExecStream {
-            error: false,
+        let opener = ParquetOpener {
             partition_index,
-            metrics: self.metrics.clone(),
-            object_store,
-            pruning_predicate: self.pruning_predicate.clone(),
+            projection: Arc::from(projection),
             batch_size: context.session_config().batch_size,
-            schema: self.projected_schema.clone(),
-            projection,
-            remaining_rows: self.base_config.limit,
-            reader: None,
-            files: self.base_config.file_groups[partition_index].clone().into(),
-            projector: partition_col_proj,
-            adapter: SchemaAdapter::new(self.base_config.file_schema.clone()),
-            baseline_metrics: BaselineMetrics::new(&self.metrics, partition_index),
+            pruning_predicate: self.pruning_predicate.clone(),
+            table_schema: self.base_config.file_schema.clone(),
+            metrics: self.metrics.clone(),
         };
 
-        // Use spawn_blocking only if running from a tokio context (#2201)
-        match tokio::runtime::Handle::try_current() {
-            Ok(handle) => {
-                let (response_tx, response_rx) = tokio::sync::mpsc::channel(2);
-                let schema = stream.schema();
-                let join_handle = handle.spawn_blocking(move || {
-                    for result in stream {
-                        if response_tx.blocking_send(result).is_err() {
-                            break;
-                        }
-                    }
-                });
-                Ok(RecordBatchReceiverStream::create(
-                    &schema,
-                    response_rx,
-                    join_handle,
-                ))
-            }
-            Err(_) => Ok(Box::pin(stream)),
-        }
+        let stream =
+            FileStream::new(&self.base_config, partition_index, context, opener)?;

Review Comment:
   Parquet now uses the same FileStream interface as other formats, this reduces code duplication



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1152577298

   > I wonder if there is more to gain (in a future iteration of course :)) by reading the metadata and then doing buffered prefetch of only the projected columns and non-pruned row groups. If we can also crack the metadata caching then this should be a pure win.
   
   I think this is precisely what @tustvold  is working towards -- I am not sure we have a unified vision writeup / ticket anywhere but we are working on one...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1167298760

   > do you have an ETA on this PR being ready to merge?
   
   I personally think it is pretty much ready to go, there are improvements to be made for sure, but I would rather address these as follows ups to keep the diff more manageable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r907326928


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -560,8 +534,8 @@ mod tests {
         let _ = collect(exec.clone(), task_ctx.clone()).await?;
         let _ = collect(exec_projected.clone(), task_ctx).await?;
 
-        assert_bytes_scanned(exec, 2522);
-        assert_bytes_scanned(exec_projected, 1924);
+        assert_bytes_scanned(exec, 1851);

Review Comment:
   Because this PR no longer performs selection pushdown, simply pulling the entire file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170271295

   Given the tradeoffs articulated by @tustvold  in https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170267958 I think we should merge this PR.
   
   @Ted-Jiang what do you think?
   
   cc @thinkharderdev @matthewmturner  @andygrove @liukun4515 @yjshen @wjones127 @houqp  -- any other thoughts / concerns before doing so? It will cause churn downstream but we have know that ever since https://github.com/apache/arrow-datafusion/issues/2489 was proposed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170705539

   > Finally got profiles (by switching the VM to fedora), and it certainly fits with my hypothesis above
   > 
   > ![image](https://user-images.githubusercontent.com/1781103/176465394-b57873aa-6e4c-410b-b202-de43e0b13bff.png)
   > 
   > On the left we have master, and right this branch. The CPU activity under `parquet_query_s` demarcates each benchmark iteration, within this you have two row groups being read. We can clearly see that with this PR there is a noticeable delay as it fetches the bytes into memory before starting decoding the data, whereas master interleaves the IO and decoding. There is a trade-off here, the approach of master is faster for this particular benchmark, but comes at the cost of stalling out worker threads on IO that could have been doing other work during decode.
   > 
   > There are some ways we could potentially improve this, e.g. interleaving IO at the page instead of column chunk, but this is unlikely to help with object storage and may actually perform worse. I'm not sure if this is something worth optimising, but would appreciate other people's thoughts
   
   @tustvold Thanks a lot for your sharing 👍.  
   I am not clear about the `whereas master interleaves the IO and decoding` i think master use block IO, decode must wait for IO. this patch uses interleaving with async function to reduce the blocked IO.
   If i miss something plz tell me 😂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1151065436

   > My personal preference would be for 9.0.0 to include the switch so we can start to bring the ecosystem along, but I'm not sure if the timings will work out for that and I don't feel especially strongly. @alamb probably has a view on this.
   
   I also recommend waiting until after the 9.0.0 release. Rationale:
   1. The datafusion releases are already substantial effort, so anything we can do to reduce the potential issues requiring a second one is good
   2. I think given the wide ranging implications of the PR up/down the ecosystem, we should have some additional reviewers on it prior to merging
   3. I believe @tustvold  will be out next week so waiting for his return is probably the wisest course of action


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1169493419

   > Benchmarking tokio: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required = 'p...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.0s, or reduce sample count to 40.
   tokio: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required = 'p...                                                                            
                           time:   [119.38 ms 119.57 ms 119.78 ms]
                           change: [+15.290% +15.988% +16.582%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     5 (5.00%) high mild
   
   
   >Benchmarking scheduled: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required ...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.4s, or reduce sample count to 40.
   scheduled: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required ...                                                                            
                           time:   [123.68 ms 123.88 ms 124.11 ms]
                           change: [+11.570% +12.168% +12.748%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 11 outliers among 100 measurements (11.00%)
     6 (6.00%) high mild
   
   🤔 why these case has such regression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

rdettai commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1172018118

   Great! I missed that the donation was in progress. Obviously no need to wait then 😉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1172882032

   I took the liberty of merging up from master to resolve some conflicts in `Cargo.toml` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1168455127

   It seems use  IOx `ObjectStore` will only support `asyn` reader


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1167893169

   Inspired by @jorgecarleitao 's comment on https://github.com/apache/arrow-datafusion/issues/2709#issuecomment-1167334326
   
   Given how core the object store abstraction is to datafusion and thus how core https://crates.io/crates/object_store would be if this PR is merged, can we clarify / change how this object store crate is governed? Are we planning to  donate it to apache arrow (to be included in the `arrow-datafusion` repo)? 
   
   I realize from an initial development perspective, having it in a separate crate may be easier to iterate, but it is probably worth considering the longer term at this junction. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1167926271

   > Are we planning to donate it to apache arrow (to be included in the arrow-datafusion repo)?
   
   I personally would be fine with it being part of arrow-datafusion or arrow-rs, or some other Apache project. It is mainly a separate project currently as that was what was simplest :shrug: I think given the C++ arrow has a filesystem abstraction, providing an object_store abstraction wouldn't be all that rogue.
   
   That being said, I'm not really sure I agree that the object store abstraction is all that core to DataFusion. It is just an IO abstraction used at the edges of plans, it isn't like `futures`, `arrow` that is fundamental to any use-case, but again I don't really feel all that strongly :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

andygrove commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170773379

   I haven't reviewed the changes here yet but I have no objection to this being merged if the community supports it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

rdettai commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170856975

   I would be interested with @wesm point of view on this governance question. Just to recap the question:
   -> we are about to replace the file system abstraction (that we call object store here, https://github.com/apache/arrow-datafusion/tree/master/datafusion/data-access) with an external one that is currently owned by InfluxData (https://github.com/influxdata/object_store_rs/blob/main/src/lib.rs). They are some concerns about whether this is a wise decision or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

andygrove commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1150195658

   @tustvold I am planning on creating the 9.0.0 RC on Friday. Would we want to hold off merging this until after the 9.0.0 release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170495447

   +1 for me.  I will add a note to the README of the s3 object store repo to let users know of the new crate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1168494311

   @alamb @tustvold  
   After skimming the code, I think iox object_store not support hdfs for now, if we need this feature, we should support it in `object_store ` am i right 🤔?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1168860442

   Master uses spawn_blocking which effectively gives a single dedicated thread to file decode, this effectively gives it an unfair advantage when running benchmarks on an otherwise idle machine. However, compared against master with spawn_blocking disabled and with https://github.com/apache/arrow-rs/pull/1956 we get
   
   ```
   tokio: select dict_10_optional from t                                                                            
                           time:   [8.9050 ms 8.9173 ms 8.9295 ms]
                           change: [-2.7950% -2.6195% -2.4525%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) low mild
     1 (1.00%) high mild
   
   scheduled: select dict_10_optional from t                                                                            
                           time:   [9.9143 ms 9.9508 ms 10.001 ms]
                           change: [-4.0207% -3.6746% -3.2435%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   
   tokio: select dict_100_optional from t                                                                            
                           time:   [9.0259 ms 9.0440 ms 9.0644 ms]
                           change: [-4.3203% -4.0547% -3.7845%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) high mild
     4 (4.00%) high severe
   
   scheduled: select dict_100_optional from t                                                                            
                           time:   [10.147 ms 10.159 ms 10.171 ms]
                           change: [-3.7494% -3.5730% -3.3990%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   tokio: select dict_1000_optional from t                                                                            
                           time:   [9.1436 ms 9.1600 ms 9.1782 ms]
                           change: [-3.3663% -3.0933% -2.8203%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) high mild
     4 (4.00%) high severe
   
   scheduled: select dict_1000_optional from t                                                                            
                           time:   [10.196 ms 10.207 ms 10.220 ms]
                           change: [-3.9349% -3.7611% -3.5989%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     7 (7.00%) high mild
     1 (1.00%) high severe
   
   tokio: select count(*) from t where dict_10_required = 'prefix#0'                                                                            
                           time:   [8.9137 ms 8.9359 ms 8.9579 ms]
                           change: [-19.374% -18.929% -18.489%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
   
   scheduled: select count(*) from t where dict_10_required = 'prefix#0'                                                                            
                           time:   [8.4053 ms 8.4312 ms 8.4574 ms]
                           change: [-1.4720% -1.0417% -0.5966%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   
   tokio: select count(*) from t where dict_100_required = 'prefix#0'                                                                            
                           time:   [7.7844 ms 7.8003 ms 7.8168 ms]
                           change: [-12.353% -11.623% -10.882%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   scheduled: select count(*) from t where dict_100_required = 'prefix#0'                                                                            
                           time:   [7.2560 ms 7.3004 ms 7.3455 ms]
                           change: [+0.6743% +1.2697% +1.9890%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   
   tokio: select count(*) from t where dict_1000_required = 'prefix#0'                                                                            
                           time:   [7.3898 ms 7.4287 ms 7.4677 ms]
                           change: [+0.4728% +1.1875% +1.9094%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) low mild
   
   scheduled: select count(*) from t where dict_1000_required = 'prefix#0'                                                                            
                           time:   [6.6191 ms 6.6347 ms 6.6506 ms]
                           change: [-2.3638% -1.9639% -1.5633%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   
   tokio: select i64_optional from t where dict_10_required = 'prefix#2' and dict_1000_required = 'pref...                                                                            
                           time:   [17.647 ms 17.673 ms 17.701 ms]
                           change: [-4.0966% -3.7503% -3.4007%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     6 (6.00%) high mild
     2 (2.00%) high severe
   
   scheduled: select i64_optional from t where dict_10_required = 'prefix#2' and dict_1000_required = '...                                                                            
                           time:   [17.434 ms 17.460 ms 17.486 ms]
                           change: [-3.2385% -2.9102% -2.5799%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   tokio: select i64_required from t where dict_10_required = 'prefix#2' and dict_1000_required = 'pref...                                                                            
                           time:   [16.511 ms 16.538 ms 16.566 ms]
                           change: [-2.9112% -2.3987% -1.9441%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   scheduled: select i64_required from t where dict_10_required = 'prefix#2' and dict_1000_required = '...                                                                            
                           time:   [16.202 ms 16.237 ms 16.271 ms]
                           change: [-1.3293% -0.9066% -0.5069%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) low mild
   
   Benchmarking tokio: select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'p...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, or reduce sample count to 70.
   tokio: select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'p...                                                                            
                           time:   [70.205 ms 70.289 ms 70.378 ms]
                           change: [-4.6151% -4.2040% -3.7925%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   Benchmarking scheduled: select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required ...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, or reduce sample count to 60.
   scheduled: select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required ...                                                                            
                           time:   [74.526 ms 74.615 ms 74.699 ms]
                           change: [-2.3271% -2.0266% -1.7270%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) low mild
   
   Benchmarking tokio: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required = 'p...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.0s, or reduce sample count to 40.
   tokio: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required = 'p...                                                                            
                           time:   [119.38 ms 119.57 ms 119.78 ms]
                           change: [+15.290% +15.988% +16.582%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     5 (5.00%) high mild
   
   Benchmarking scheduled: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required ...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.4s, or reduce sample count to 40.
   scheduled: select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required ...                                                                            
                           time:   [123.68 ms 123.88 ms 124.11 ms]
                           change: [+11.570% +12.168% +12.748%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 11 outliers among 100 measurements (11.00%)
     6 (6.00%) high mild
     5 (5.00%) high severe
   
   tokio: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_optional...                                                                            
                           time:   [21.048 ms 21.074 ms 21.099 ms]
                           change: [-2.9939% -2.6899% -2.3848%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) low mild
   
   scheduled: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_opti...                                                                            
                           time:   [23.353 ms 23.404 ms 23.457 ms]
                           change: [-2.2225% -1.9064% -1.5766%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   tokio: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_optional... #2                                                                            
                           time:   [21.421 ms 21.462 ms 21.505 ms]
                           change: [-1.7021% -1.3476% -1.0027%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   
   scheduled: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_opti... #2                                                                            
                           time:   [23.415 ms 23.485 ms 23.558 ms]
                           change: [-1.6729% -1.2924% -0.9078%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   tokio: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_required...                                                                            
                           time:   [19.921 ms 19.967 ms 20.014 ms]
                           change: [-1.2083% -0.7998% -0.4072%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   
   scheduled: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_requ...                                                                            
                           time:   [21.143 ms 21.225 ms 21.310 ms]
                           change: [-1.2061% -0.7093% -0.2452%] (p = 0.01 < 0.05)
                           Change within noise threshold.
   
   tokio: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_required... #2                                                                            
                           time:   [20.369 ms 20.433 ms 20.495 ms]
                           change: [+1.8375% +2.2683% +2.7020%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 7 outliers among 100 measurements (7.00%)
     7 (7.00%) low mild
   
   scheduled: select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_requ... #2                                                                            
                           time:   [21.510 ms 21.586 ms 21.663 ms]
                           change: [+1.4549% +1.9070% +2.3468%] (p = 0.00 < 0.05)
                           Performance has regressed.
   
   tokio: select dict_10_optional, count(*) from t group by dict_10_optional                                                                            
                           time:   [23.850 ms 23.907 ms 23.963 ms]
                           change: [-13.523% -13.268% -12.998%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   scheduled: select dict_10_optional, count(*) from t group by dict_10_optional                                                                            
                           time:   [23.547 ms 23.598 ms 23.650 ms]
                           change: [-1.3270% -1.0043% -0.6899%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking tokio: select dict_10_optional, dict_100_optional, count(*) from t group by dict_10_optional, dict_1...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, or reduce sample count to 90.
   tokio: select dict_10_optional, dict_100_optional, count(*) from t group by dict_10_optional, dict_1...                                                                            
                           time:   [53.017 ms 53.120 ms 53.226 ms]
                           change: [-16.476% -15.993% -15.511%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking scheduled: select dict_10_optional, dict_100_optional, count(*) from t group by dict_10_optional, di...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, or reduce sample count to 80.
   scheduled: select dict_10_optional, dict_100_optional, count(*) from t group by dict_10_optional, di...                                                                            
                           time:   [56.802 ms 56.953 ms 57.102 ms]
                           change: [-4.2172% -3.8167% -3.4177%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     4 (4.00%) low mild
     3 (3.00%) high mild
   
   Benchmarking tokio: select dict_10_optional, dict_100_optional, MIN(f64_required), MAX(f64_required), AVG(f64_req...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.0s, or reduce sample count to 50.
   tokio: select dict_10_optional, dict_100_optional, MIN(f64_required), MAX(f64_required), AVG(f64_req...                                                                            
                           time:   [90.751 ms 90.944 ms 91.147 ms]
                           change: [-11.288% -10.904% -10.534%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   Benchmarking scheduled: select dict_10_optional, dict_100_optional, MIN(f64_required), MAX(f64_required), AVG(f64...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.4s, or reduce sample count to 50.
   scheduled: select dict_10_optional, dict_100_optional, MIN(f64_required), MAX(f64_required), AVG(f64...                                                                            
                           time:   [93.324 ms 93.439 ms 93.555 ms]
                           change: [-3.1669% -2.9244% -2.6872%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking tokio: select dict_10_optional, dict_100_optional, MIN(f64_optional), MAX(f64_optional), AVG(f64_opt...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.9s, or reduce sample count to 40.
   tokio: select dict_10_optional, dict_100_optional, MIN(f64_optional), MAX(f64_optional), AVG(f64_opt...                                                                            
                           time:   [108.85 ms 108.97 ms 109.09 ms]
                           change: [-14.437% -14.014% -13.595%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking scheduled: select dict_10_optional, dict_100_optional, MIN(f64_optional), MAX(f64_optional), AVG(f64...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.2s, or reduce sample count to 40.
   scheduled: select dict_10_optional, dict_100_optional, MIN(f64_optional), MAX(f64_optional), AVG(f64...                                                                            
                           time:   [112.08 ms 112.19 ms 112.29 ms]
                           change: [-5.8059% -5.3435% -4.8943%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) low mild
   
   Benchmarking tokio: select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_opt...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.2s, or reduce sample count to 40.
   tokio: select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_opt...                                                                            
                           time:   [102.02 ms 102.26 ms 102.62 ms]
                           change: [-9.8760% -9.4604% -9.0112%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   Benchmarking scheduled: select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64...: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.2s, or reduce sample count to 40.
   scheduled: select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64...                                                                            
                           time:   [102.21 ms 102.35 ms 102.50 ms]
                           change: [-1.7080% -1.3686% -1.0517%] (p = 0.00 < 0.05)
                           Performance has improved.
   ```
   
   Which would suggest performance is comparable :tada:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170120699

   > is the the eventual plan to interleave IO and CPU decoding
   
   Yes, once we properly support reading and writing the column index structures https://github.com/apache/arrow-rs/issues/1705 we will have sufficient information to interleave IO at the page-level. Currently ParquetRecordBatchStream does not have information on where the pages are actually located, which means it cannot interleave IO at a granularity lower than the column chunk. That being said we could potentially use a heuristic and only fetch the first 1MB or something, I'll have an experiment :thinking: 
   
   Full disclosure the column in question is somewhat degenerate, it is 106MB over 100x 1MB pages across two row groups. Another obvious way to improve the performance would be to reduce the size of the row groups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1150132924

   I think this is now ready for review, I've created https://github.com/apache/arrow-datafusion/pull/2711 which uses currently unreleased functionality in arrow-rs to do byte range fetches to object storage.
   
   This PR does represent a 10-20% performance regression in the parquet SQL benchmarks when operating on local files. This largely results from moving from spawn_blocking and the corresponding scheduler implications documented in https://github.com/apache/arrow-rs/issues/1473. 
   
   However, I am inclined to think this is fine for a couple of reasons:
   
   * The work on the new scheduler, which is currently blocked by this PR, was specifically created to address this scheduling disparity
   * The difference becomes inconsequential for any non-trivial queries
   * The ongoing work by @Ted-Jiang will help to reduce the IO costs of parquet
   * I think this lays a solid foundation on which we can iterate


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r894712676


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -294,37 +287,50 @@ fn summarize_min_max(
     }
 }
 
-/// Read and parse the schema of the Parquet file at location `path`
-fn fetch_schema(store: &dyn ObjectStore, file: &FileMeta) -> Result<Schema> {
-    let object_reader = store.file_reader(file.sized_file.clone())?;
-    let obj_reader = ChunkObjectReader {
-        object_reader,
-        bytes_scanned: None,
-    };
-    let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
-    let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
-    let schema = arrow_reader.get_schema()?;
+async fn fetch_metadata(
+    store: &dyn ObjectStore,
+    file: &ObjectMeta,
+) -> Result<ParquetMetaData> {
+    // TODO: Fetching the entire file to get metadata is wasteful

Review Comment:
   I thought in the previous implementation it would only read the footer? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r907325907


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -580,7 +554,7 @@ mod tests {
         let batches = collect(exec, task_ctx).await?;
         assert_eq!(1, batches.len());
         assert_eq!(11, batches[0].num_columns());
-        assert_eq!(8, batches[0].num_rows());
+        assert_eq!(1, batches[0].num_rows());

Review Comment:
   Because we now use FileStream which slices the returned batches based on the provided limit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170089099

Finally got profiles (by switching the VM to fedora), and it certainly fits with my hypothesis above

![image](https://user-images.githubusercontent.com/1781103/176465394-b57873aa-6e4c-410b-b202-de43e0b13bff.png)

On the left we have master, and right this branch. The CPU activity under `parquet_query_s` demarcates each benchmark iteration, within this you have two row groups being read. We can clearly see that with this PR there is a noticeable delay as it fetches the bytes into memory before starting decoding the data, whereas master interleaves the IO and decoding. There is a trade-off here, the approach of master is faster for this particular benchmark, but comes at the cost of stalling out worker threads on IO that could have been doing other work during decode.

There are some ways we could potentially improve this, e.g. interleaving IO at the page instead of column chunk, but this is unlikely to help with object storage and may actually perform worse. I'm not sure if this is something worth optimising, but would appreciate other people's thoughts

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb closed pull request #2677: Switch to object_store crate (#2489)
URL: https://github.com/apache/arrow-datafusion/pull/2677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb merged PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1147700957

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2677?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2677](https://codecov.io/gh/apache/arrow-datafusion/pull/2677?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9a5442d) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/bbb674a0bf16fb754afb7b6451d42061f23e96c8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bbb674a) will **decrease** coverage by `0.11%`.
   > The diff coverage is `86.05%`.
   
   > :exclamation: Current head 9a5442d differs from pull request most recent head 6344612. Consider uploading reports for the commit 6344612 to get more accurate results
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2677      +/-   ##
   ==========================================
   - Coverage   84.66%   84.54%   -0.12%     
   ==========================================
     Files         270      270              
     Lines       46919    46930      +11     
   ==========================================
   - Hits        39723    39679      -44     
   - Misses       7196     7251      +55     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/core/src/execution/runtime\_env.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9leGVjdXRpb24vcnVudGltZV9lbnYucnM=) | `75.67% <ø> (ø)` | |
   | [...afusion/core/src/physical\_plan/file\_format/avro.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL2ZpbGVfZm9ybWF0L2F2cm8ucnM=) | `0.00% <0.00%> (ø)` | |
   | [datafusion/core/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL21vZC5ycw==) | `88.00% <ø> (ø)` | |
   | [datafusion/core/src/datasource/file\_format/avro.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhc291cmNlL2ZpbGVfZm9ybWF0L2F2cm8ucnM=) | `61.53% <50.00%> (-8.03%)` | :arrow_down: |
   | [datafusion/core/src/datasource/file\_format/json.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhc291cmNlL2ZpbGVfZm9ybWF0L2pzb24ucnM=) | `93.75% <64.28%> (-5.13%)` | :arrow_down: |
   | [datafusion/common/src/error.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb21tb24vc3JjL2Vycm9yLnJz) | `75.29% <66.66%> (-6.99%)` | :arrow_down: |
   | [datafusion/core/src/datasource/listing/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhc291cmNlL2xpc3RpbmcvbW9kLnJz) | `55.55% <66.66%> (+10.10%)` | :arrow_up: |
   | [datafusion/core/tests/path\_partition.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3BhdGhfcGFydGl0aW9uLnJz) | `85.55% <76.00%> (-1.00%)` | :arrow_down: |
   | [...afusion/core/src/physical\_plan/file\_format/json.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL2ZpbGVfZm9ybWF0L2pzb24ucnM=) | `91.06% <77.77%> (-2.13%)` | :arrow_down: |
   | [datafusion/core/src/datasource/file\_format/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhc291cmNlL2ZpbGVfZm9ybWF0L2Nzdi5ycw==) | `98.91% <80.00%> (-1.09%)` | :arrow_down: |
   | ... and [28 more](https://codecov.io/gh/apache/arrow-datafusion/pull/2677/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2677?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2677?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [bbb674a...6344612](https://codecov.io/gh/apache/arrow-datafusion/pull/2677?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r893456167


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -294,37 +287,50 @@ fn summarize_min_max(
     }
 }
 
-/// Read and parse the schema of the Parquet file at location `path`
-fn fetch_schema(store: &dyn ObjectStore, file: &FileMeta) -> Result<Schema> {
-    let object_reader = store.file_reader(file.sized_file.clone())?;
-    let obj_reader = ChunkObjectReader {
-        object_reader,
-        bytes_scanned: None,
-    };
-    let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
-    let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
-    let schema = arrow_reader.get_schema()?;
+async fn fetch_metadata(
+    store: &dyn ObjectStore,
+    file: &ObjectMeta,
+) -> Result<ParquetMetaData> {
+    // TODO: Fetching the entire file to get metadata is wasteful

Review Comment:
   To be clear, the entire file is fetched to read metadata before this PR too (so this PR doesn't make it better or worse).
   
   Maybe it is worth a link to the work in arrow-rs related to avoiding this



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -294,37 +287,50 @@ fn summarize_min_max(
     }
 }
 
-/// Read and parse the schema of the Parquet file at location `path`
-fn fetch_schema(store: &dyn ObjectStore, file: &FileMeta) -> Result<Schema> {
-    let object_reader = store.file_reader(file.sized_file.clone())?;
-    let obj_reader = ChunkObjectReader {
-        object_reader,
-        bytes_scanned: None,
-    };
-    let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
-    let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
-    let schema = arrow_reader.get_schema()?;
+async fn fetch_metadata(
+    store: &dyn ObjectStore,
+    file: &ObjectMeta,
+) -> Result<ParquetMetaData> {
+    // TODO: Fetching the entire file to get metadata is wasteful
+    match store.get(&file.location).await? {
+        GetResult::File(file, _) => {
+            Ok(SerializedFileReader::new(file)?.metadata().clone())
+        }
+        r @ GetResult::Stream(_) => {
+            let data = r.bytes().await?;
+            let cursor = SliceableCursor::new(data.to_vec());
+            Ok(SerializedFileReader::new(cursor)?.metadata().clone())
+        }
+    }
+}
 
+/// Read and parse the schema of the Parquet file at location `path`
+async fn fetch_schema(store: &dyn ObjectStore, file: &ObjectMeta) -> Result<Schema> {
+    let metadata = fetch_metadata(store, file).await?;
+    let file_metadata = metadata.file_metadata();
+    let schema = parquet_to_arrow_schema(
+        file_metadata.schema_descr(),
+        file_metadata.key_value_metadata(),
+    )?;
     Ok(schema)
 }
 
 /// Read and parse the statistics of the Parquet file at location `path`
-fn fetch_statistics(
+async fn fetch_statistics(
     store: &dyn ObjectStore,
     table_schema: SchemaRef,
-    file: &FileMeta,
+    file: &ObjectMeta,
 ) -> Result<Statistics> {
-    let object_reader = store.file_reader(file.sized_file.clone())?;
-    let obj_reader = ChunkObjectReader {
-        object_reader,
-        bytes_scanned: None,
-    };
-    let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
-    let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
-    let file_schema = arrow_reader.get_schema()?;
+    let metadata = fetch_metadata(store, file).await?;

Review Comment:
   it is somewhat appalling how many times a file is fetched during planning -- can't wait to have a way to cache the metadata



##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -80,104 +71,146 @@ pub struct FileStream<F: FormatReaderOpener> {
     pc_projector: PartitionColumnProjector,
     /// the store from which to source the files.
     object_store: Arc<dyn ObjectStore>,
+    /// The stream state
+    state: FileStreamState,
 }
 
-impl<F: FormatReaderOpener> FileStream<F> {
+enum FileStreamState {
+    Idle,
+    Open {
+        future: ReaderFuture,
+        partition_values: Vec<ScalarValue>,
+    },
+    Scan {
+        /// Partitioning column values for the current batch_iter
+        partition_values: Vec<ScalarValue>,
+        /// The reader instance
+        reader: BoxStream<'static, ArrowResult<RecordBatch>>,
+    },
+    Error,
+    Limit,
+}
+
+impl<F: FormatReader> FileStream<F> {
     pub fn new(
-        object_store: Arc<dyn ObjectStore>,
-        files: Vec<PartitionedFile>,
+        config: &FileScanConfig,
+        partition: usize,
+        context: Arc<TaskContext>,
         file_reader: F,
-        projected_schema: SchemaRef,
-        limit: Option<usize>,
-        table_partition_cols: Vec<String>,
-    ) -> Self {
+    ) -> Result<Self> {
+        let (projected_schema, _) = config.project();
         let pc_projector = PartitionColumnProjector::new(
-            Arc::clone(&projected_schema),
-            &table_partition_cols,
+            projected_schema.clone(),
+            &config.table_partition_cols,
         );
 
-        Self {
-            file_iter: Box::new(files.into_iter()),
-            batch_iter: Box::new(iter::empty()),
-            partition_values: vec![],
-            remain: limit,
+        let files = config.file_groups[partition].clone();
+
+        let object_store = context
+            .runtime_env()
+            .object_store(&config.object_store_url)?;
+
+        Ok(Self {
+            file_iter: files.into(),
             projected_schema,
+            remain: config.limit,
             file_reader,
             pc_projector,
             object_store,
-        }
+            state: FileStreamState::Idle,
+        })
     }
 
-    /// Acts as a flat_map of record batches over files. Adds the partitioning
-    /// Columns to the returned record batches.
-    fn next_batch(&mut self) -> Option<ArrowResult<RecordBatch>> {
-        match self.batch_iter.next() {
-            Some(Ok(batch)) => {
-                Some(self.pc_projector.project(batch, &self.partition_values))
-            }
-            Some(Err(e)) => Some(Err(e)),
-            None => match self.file_iter.next() {
-                Some(f) => {
-                    self.partition_values = f.partition_values;
-                    self.object_store
-                        .file_reader(f.file_meta.sized_file)
-                        .and_then(|r| r.sync_reader())
-                        .map_err(|e| ArrowError::ExternalError(Box::new(e)))
-                        .and_then(|f| {
-                            self.batch_iter = (self.file_reader)(f, &self.remain);
-                            self.next_batch().transpose()
-                        })
-                        .transpose()
+    fn poll_inner(

Review Comment:
   cc @rdettai 



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -580,7 +554,7 @@ mod tests {
         let batches = collect(exec, task_ctx).await?;
         assert_eq!(1, batches.len());
         assert_eq!(11, batches[0].num_columns());
-        assert_eq!(8, batches[0].num_rows());
+        assert_eq!(1, batches[0].num_rows());

Review Comment:
   why is this different?



##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -80,104 +71,146 @@ pub struct FileStream<F: FormatReaderOpener> {
     pc_projector: PartitionColumnProjector,
     /// the store from which to source the files.
     object_store: Arc<dyn ObjectStore>,
+    /// The stream state
+    state: FileStreamState,
 }
 
-impl<F: FormatReaderOpener> FileStream<F> {
+enum FileStreamState {
+    Idle,
+    Open {
+        future: ReaderFuture,

Review Comment:
   Perhaps we can add some docstrings -- especially for what `future` represents



##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -102,86 +103,72 @@ impl ListingTableUrl {
             false => Url::from_directory_path(path).unwrap(),
         };
 
-        Ok(Self { url, glob })
+        Ok(Self::new(url, glob))
+    }
+
+    /// Creates a new [`ListingTableUrl`] from a url and optional glob expression
+    fn new(url: Url, glob: Option<Pattern>) -> Self {
+        // TODO: fix this upstream

Review Comment:
   Fix what? Is there a ticket?



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -560,8 +534,8 @@ mod tests {
         let _ = collect(exec.clone(), task_ctx.clone()).await?;
         let _ = collect(exec_projected.clone(), task_ctx).await?;
 
-        assert_bytes_scanned(exec, 2522);
-        assert_bytes_scanned(exec_projected, 1924);
+        assert_bytes_scanned(exec, 1851);

Review Comment:
   Can you explain why this number is different?



##########
datafusion/core/src/datasource/file_format/json.rs:
##########
@@ -71,21 +71,33 @@ impl FileFormat for JsonFormat {
     async fn infer_schema(
         &self,
         store: &Arc<dyn ObjectStore>,
-        files: &[FileMeta],
+        objects: &[ObjectMeta],
     ) -> Result<SchemaRef> {
         let mut schemas = Vec::new();
         let mut records_to_read = self.schema_infer_max_rec.unwrap_or(usize::MAX);
-        for file in files {
-            let reader = store.file_reader(file.sized_file.clone())?.sync_reader()?;
-            let mut reader = BufReader::new(reader);
-            let iter = ValueIter::new(&mut reader, None);
-            let schema = infer_json_schema_from_iterator(iter.take_while(|_| {
+        for object in objects {
+            let mut take_while = || {
                 let should_take = records_to_read > 0;
                 if should_take {
                     records_to_read -= 1;
                 }
                 should_take
-            }))?;
+            };
+
+            let schema = match store.get(&object.location).await? {

Review Comment:
   I like how this interface allows for specialized access of LocalFiles as well as streams 👍 



##########
datafusion/core/Cargo.toml:
##########
@@ -58,9 +58,9 @@ ahash = { version = "0.7", default-features = false }
 arrow = { version = "15.0.0", features = ["prettyprint"] }
 async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
+bytes = "1.1"
 chrono = { version = "0.4", default-features = false }
-datafusion-common = { path = "../common", version = "8.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../data-access", version = "8.0.0" }

Review Comment:
   should we also perhaps remove the `data-access` directory as part of the same PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1152399545

   This is great work - really excited to get this integrated.  I hope to provide some comments / questions this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1169491378

   > I had anticipated that https://github.com/datafusion-contrib/datafusion-objectstore-hdfs would eventually be updated to use the new object_store abstraction, but I hadn't gotten the impression this was a blocker for anyone. Was I mistaken?
   
   @tustvold our team is using datafusion read parquet from hdfs, if the time is ready to do integration with hdfs plz tell me 😊 i will follow this up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1169847007

   Not entirely sure, I can't reproduce it on my local machine, only on a remote server where I can't easily run perf because of [debian](https://michcioperz.com/post/slow-perf-script/) shenanigans.
   
   My immediate guess would be that it is something to do with string_required being a very large column chunk, and so we are seeing the downside of separating IO/compute, namely added latency whilst it reads the data to memory, but I'm not too certain. Moving the source parquet file to tmpfs doesn't appear to help. Additionally string_optional is only half the size and yet does not exhibit the same behaviour. More investigation needed :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170504609

   I apologize if i missed it (the github UI is being buggy for me right now) but it might be worth adding to the docs examples of how to use this with different `object_store` features enabled.  this could be done as a follow on though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1170709314

   
   > I wonder if we can find some workaround so that @Ted-Jiang and his team doesn't lose performance while we continue to make progress (e.g. could we fetch to local file? or put in some hack for people who wanted to decode using a blocking IO thread or something)
   
   @alamb Thanks for your kindly attention ❤️ I think this change is reasonable !
    we can keep both async and non-async, we will test in our env.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] wesm commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

wesm commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1171499079

   I think it's fine to switch without the code donation and not wait, but if you think that other DataFusion contributors will want to participate in the maintenance and governance of the object store crate, then doing the code donation sounds like a good idea to me. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1168501535

   I had anticipated that https://github.com/datafusion-contrib/datafusion-objectstore-hdfs would eventually be updated to use the new object_store abstraction, but I hadn't gotten the impression this was a blocker for anyone. Was I mistaken?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb closed pull request #2677: Switch to object_store crate (#2489)
URL: https://github.com/apache/arrow-datafusion/pull/2677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1167614100

   I opted to merge in #2711 into this PR, as it avoids this being a regression w.r.t bytes scanned


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1165812303

   @tustvold  do you have an ETA on this PR being ready to merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#issuecomment-1150207250

   > Would we want to hold off merging this until after the 9.0.0 release
   
   That isn't really my call to make, especially since IOx consumes via git pin and not a release, however, I would say:
   
   * Without #2711 which is dependent on the next arrow-rs release, this may represent a regression for people consuming data from remote object storage, although its a bit of an apples and oranges comparison, fetching whole files vs futures::block_on range requests, and unclear which is necessarily better
   * The sooner we make the switch the less painful it will be
   
   My personal preference would be for 9.0.0 to include the switch so we can start to bring the ecosystem along, but I'm not sure if the timings will work out for that. @alamb probably has a view on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r892624015


##########
datafusion/core/src/datasource/listing/helpers.rs:
##########
@@ -558,100 +554,71 @@ mod tests {
             None,
             parse_partitions_for_path(
                 &ListingTableUrl::parse("file:///bucket/mytable").unwrap(),
-                "bucket/mytable/v1/file.csv",
+                &Path::from("bucket/mytable/v1/file.csv"),
                 &[String::from("mypartition")]
             )
         );
         assert_eq!(
             Some(vec!["v1", "v2"]),
             parse_partitions_for_path(
                 &ListingTableUrl::parse("file:///bucket/mytable").unwrap(),
-                "bucket/mytable/mypartition=v1/otherpartition=v2/file.csv",
+                &Path::from("bucket/mytable/mypartition=v1/otherpartition=v2/file.csv"),
                 &[String::from("mypartition"), String::from("otherpartition")]
             )
         );
         assert_eq!(
             Some(vec!["v1"]),
             parse_partitions_for_path(
                 &ListingTableUrl::parse("file:///bucket/mytable").unwrap(),
-                "bucket/mytable/mypartition=v1/otherpartition=v2/file.csv",
-                &[String::from("mypartition")]
-            )
-        );
-    }
-
-    #[cfg(target_os = "windows")]

Review Comment:
   This test is removed as it no longer makes sense, paths are normalized



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r892625816


##########
datafusion/core/tests/sql/mod.rs:
##########
@@ -825,16 +825,9 @@ impl ExplainNormalizer {
             // Push path as is
             replacements.push((path.to_string_lossy().to_string(), key.to_string()));
 
-            // Push canonical version of path
-            let canonical = path.canonicalize().unwrap();
-            replacements.push((canonical.to_string_lossy().to_string(), key.to_string()));
-
-            if cfg!(target_family = "windows") {
-                // Push URL representation of path, to handle windows
-                let url = Url::from_file_path(canonical).unwrap();
-                let path = url.path().strip_prefix('/').unwrap();
-                replacements.push((path.to_string(), key.to_string()));
-            }
+            // Push URL representation of path

Review Comment:
   Standardized paths :tada:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2677: Switch to object_store crate (#2489)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2677:
URL: https://github.com/apache/arrow-datafusion/pull/2677#discussion_r894890298


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -294,37 +287,50 @@ fn summarize_min_max(
     }
 }
 
-/// Read and parse the schema of the Parquet file at location `path`
-fn fetch_schema(store: &dyn ObjectStore, file: &FileMeta) -> Result<Schema> {
-    let object_reader = store.file_reader(file.sized_file.clone())?;
-    let obj_reader = ChunkObjectReader {
-        object_reader,
-        bytes_scanned: None,
-    };
-    let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
-    let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
-    let schema = arrow_reader.get_schema()?;
+async fn fetch_metadata(
+    store: &dyn ObjectStore,
+    file: &ObjectMeta,
+) -> Result<ParquetMetaData> {
+    // TODO: Fetching the entire file to get metadata is wasteful

Review Comment:
   Yes, FWIW #2711 fixes this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org