You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/19 23:20:03 UTC

[GitHub] [arrow-ballista] yahoNanJing opened a new issue, #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

yahoNanJing opened a new issue, #6:
URL: https://github.com/apache/arrow-ballista/issues/6

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*) -->
   
   After introducing the object store API, to support to access remote object store for Ballista executors, there are still some gap. For example, as apache/arrow-datafusion#349 and apache/arrow-datafusion#1417 mentioned, ballista is not able to support remote object store.
   
   **Describe the solution you'd like**
   <!-- A clear and concise description of what you want to happen. -->
   
   Our workaround is to make the file path self described. For example, a local file path should be file://tmp/..., a hdfs file path should hdfs://localhost:xxx:/tmp/...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] milenkovicm commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by GitBox <gi...@apache.org>.

milenkovicm commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1234449100

   @avantgardnerio SQL would be perfect fit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Ballista] Support to access remote object store, like HDFS, S3, etc [arrow-ballista]

Posted by "YuriyGavrilov (via GitHub)" <gi...@apache.org>.

YuriyGavrilov commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1751375500

   There is also could be nice to have an uplink support of storj network https://github.com/storj/uplink (Rust bindings for libuplink https://github.com/storj-thirdparty/uplink-rust
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] saikrishna1-bidgely commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1450852673

   @ahmedriza I tried what you suggested. I built the scheduler and executor with the s3 feature added to ballista-core dependency in cargo.toml. But I'm getting the same error: `Error: DataFusionError(Execution("No object store available for s3://ballista-test-bucket/temp.csv"))`. I'm running this on Windows 10. Also, I'm running the code in another repo with this as cargo.toml:
   ```toml
   [package]
   name = "ballista-test"
   version = "0.1.0"
   edition = "2021"
   
   # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
   
   [dependencies]
   ballista = "0.11.0"
   datafusion = "18.0.0"
   tokio = "1.0"
   parquet = "29.0.0"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] ahmedriza commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by "ahmedriza (via GitHub)" <gi...@apache.org>.

ahmedriza commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1426482526

   @saikrishna1-bidgely, here's an example of what I tried and this worked.  I'm not 100% sure that this is how it is supposed to be :-)
   
   - Build the `scheduler` with the `s3` feature enabled, i.e. 
   ```
   ballista-core = { path = "../core", version = "0.10.0" , features = ["s3"] }
   ```
   - Define the following environment variables (I tested with a MinIO instance):
   ```
   AWS_DEFAULT_REGION
   AWS_ACCESS_KEY_ID
   AWS_SECRET_ACCESS_KEY
   AWS_ENDPOINT
   ```
   You may need to define additional `AWS` environment variables depending on your S3 service.
   - Start `scheduler` and `executor`
   
   I tested with the following sample code:
   ```
   use ballista::prelude::BallistaContext;
   use ballista_core::config::BallistaConfig;
   use datafusion::prelude::ParquetReadOptions;
   
   #[tokio::main]
   pub async fn main()  {
       let config = BallistaConfig::builder().build().unwrap();
       let ctx = BallistaContext::remote("localhost", 50050, &config).await.unwrap();
       let filename = "s3://foo/test.parquet";
       let df = ctx
           .read_parquet(filename, ParquetReadOptions::default())
           .await?;
       let rows = df.count().await?;
       println!("rows: {}", rows);
   }
   ```
   The code correctly returns the number of rows in the Parquet file:
   ```
   rows: 15309
   ```
   
   Last bit of logging from the `scheduler` process:
   ```
   2023-02-10T23:16:11.345868Z  INFO tokio-runtime-worker ThreadId(05) ballista_scheduler::display: === [pRdAqhp/2] Stage finished, physical plan with metrics ===
   ShuffleWriterExec: None, metrics=[output_rows=1, input_rows=1, repart_time=1ns, write_time=721.748µs]
     AggregateExec: mode=Final, gby=[], aggr=[COUNT(NULL)], metrics=[output_rows=1, elapsed_compute=28.791µs, spill_count=0, spilled_bytes=0, mem_used=0]
       CoalescePartitionsExec, metrics=[]
         ShuffleReaderExec: partitions=1, metrics=[]
   
   
   2023-02-10T23:16:11.346163Z  INFO tokio-runtime-worker ThreadId(05) ballista_scheduler::state::execution_graph: Job pRdAqhp is success, finalizing output partitions
   2023-02-10T23:16:11.346354Z  INFO tokio-runtime-worker ThreadId(05) ballista_scheduler::scheduler_server::query_stage_scheduler: Job pRdAqhp success
   ```
   
   From the `executor` process:
   ```
   2023-02-10T23:16:11.241689Z  INFO          task_runner ThreadId(22) ballista_executor::metrics: === [pRdAqhp/2/0] Physical plan with metrics ===
   ShuffleWriterExec: None, metrics=[output_rows=1, input_rows=1, write_time=721.748µs, repart_time=1ns]
     AggregateExec: mode=Final, gby=[], aggr=[COUNT(NULL)], metrics=[output_rows=1, elapsed_compute=28.791µs, spill_count=0, spilled_bytes=0, mem_used=0]
       CoalescePartitionsExec, metrics=[]
         ShuffleReaderExec: partitions=1, metrics=[]
   ```
   
   
   
   Hope this helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] saikrishna1-bidgely commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by GitBox <gi...@apache.org>.

saikrishna1-bidgely commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1371370220

   Is loading the data from S3 possible with a distributed setup right now? If so, can you provide a small example?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] avantgardnerio commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by GitBox <gi...@apache.org>.

avantgardnerio commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1234556592

   > `ObjectStore` in the `ExecutionContext` in order to use it right?
   
   I think the problem is that this must happen dynamically in the case of a DataFusion executor in Ballista. The solution I am proposing is a `TableProviderFactory` in https://github.com/apache/arrow-datafusion/pull/3311


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] milenkovicm commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by GitBox <gi...@apache.org>.

milenkovicm commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1234425565

   Had a quick look into this issue, and from what I can see, there is not nothing missing on datafusion side to have this functionality (apart from some hard work :)).
   
   Team did a great job to add support for object store in datafusion:
   
   ```rust
   use std::sync::Arc;
   use datafusion::{
       datasource::listing::{ListingTable, ListingTableConfig, ListingTableUrl},
       prelude::SessionContext,
   };
   use log::info;
   use object_store::aws::AmazonS3Builder;
   
       let ctx = SessionContext::new();
   
       let s3 = AmazonS3Builder::new()
           .with_region("us-east-1")
           .with_bucket_name("testbucket")
           .with_access_key_id("MINIO")
           .with_secret_access_key("MINIO/MINIO")
           .with_endpoint("http://localhost:9000")
           .with_allow_http(true)
           .build()
           .unwrap();
   
       let s3 = Arc::new(s3);
   
       ctx.runtime_env()
           .register_object_store("s3", "localhost:9000", s3);
   
       let url = ListingTableUrl::parse("s3://localhost:9000/testpath/").unwrap();
   
       let config = ListingTableConfig::new(url)
           .infer(&ctx.state())
           .await
           .unwrap();
   
       let table = ListingTable::try_new(config).unwrap();
       ctx.register_table("test", Arc::new(table)).unwrap();
   
       ctx.sql("SELECT * FROM test")
           .await
           .unwrap()
           .show()
           .await
           .unwrap();
   ```
   
   I give quick try with ballista `standalone`, changing code a bit to expose `RuntimeEnv` on client, scheduler, and executor and registering store on each of them manually. At the end, it did produce correct result. Currently getting to a `RuntimeEnv` is not "walk in a park", few hacks here and there were needed, but it is not hard to make it easier. It would then be possible load object store providers from configuration files.
   
   Alternatively `register_object_store` can be provided directly on the `BallistaContext` and then somehow object store configuration may be magically serialized and handled on other actors in the system. `AmazonS3Builder` should probably be modified so it can be serialized.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] avantgardnerio commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by GitBox <gi...@apache.org>.

avantgardnerio commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1234443642

   @yahoNanJing this issue seems related to https://github.com/apache/arrow-datafusion/pull/3311 where we are working towards allowing users to register `delta-rs` tables dynamically through SQL at runtime.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] saikrishna1-bidgely commented on issue #6: [Ballista] Support to access remote object store, like HDFS, S3, etc

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on issue #6:
URL: https://github.com/apache/arrow-ballista/issues/6#issuecomment-1450795557

   @ahmedriza can you make this into a PR pls. That would be a lot helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org