You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/11 09:59:08 UTC

[GitHub] [arrow-datafusion] yjshen commented on pull request #811: Add support for reading remote storage systems

yjshen commented on pull request #811:
URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-896686361


   @houqp @alamb I've done with the original implementation by abstracting file listing/reading logic into `ObjectStore` and `ObjectReader`, and I think it's ready for review again.
   
   ```rust
   /// Objct Reader for one file in a object store
   pub trait ObjectReader {
       /// Get reader for a part [start, start + length] in the file
       fn get_reader(&self, start: u64, length: usize) -> Box<dyn Read>;
   
       /// Get lenght for the file
       fn length(&self) -> u64;
   }
   
   /// A ObjectStore abstracts access to an underlying file/object storage.
   /// It maps strings (e.g. URLs, filesystem paths, etc) to sources of bytes
   pub trait ObjectStore: Sync + Send + Debug {
       /// Returns the object store as [`Any`](std::any::Any)
       /// so that it can be downcast to a specific implementation.
       fn as_any(&self) -> &dyn Any;
   
       /// Returns all the files with filename extension `ext` in path `prefix`
       fn list_all_files(&self, prefix: &str, ext: &str) -> Result<Vec<String>>;
   
       /// Get object reader for one file
       fn get_reader(&self, file_path: &str) -> Result<Arc<dyn ObjectReader>>;
   }
   ```
   
   Currently, there are several things remaining (I suppose that are not blockers for this PR, please correct me if get something wrong):
   - Async listing (`list_all_files`) as well as async reading (`get_reader`).
   - Figure out for ballista how to register `ObjectStore` in the client and pass the registration on to executors.
   - Make JSON / CSV read from `ObjectReader` as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org