You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/12 12:10:49 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue, #2208: Support SymlinkTextInputFormat FileFormat

tustvold opened a new issue, #2208:
URL: https://github.com/apache/arrow-datafusion/issues/2208

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Hive compatible metastores, such as AWS Glue (#2206) do not store the individual files within a partition, and instead rely on listing the files in object storage at query time.
   
   This becomes problematic when interacting with data that is either:
   
   * Not partitioned in the way that Hive expects
   * Rewrites data leaving parquet files behind that no longer form part of the most recent snapshot (e.g. Delta Lake / IOx)
   
   **Describe the solution you'd like**
   
   Much like we currently support a FileFormat of CSV or Parquet, I would like to support a FileFormat of `SymlinkTextInputFormat`. This is just a newline-delimited list of files, stored in object storage alongside a table or partition.
   
   The best documentation for this functionality I can find is [here](https://athena.guide/articles/stitching-tables-with-symlinktextinputformat/), and there is documentation [here](https://docs.delta.io/latest/presto-integration.html) on how it is used to enable inter-operation between Presto and Data Lake. 
   
   *I'm not entirely sure how the query engine determines the format of the symlink targets, but I guess it must use the file suffix??*
   
   **Describe alternatives you've considered**
   
   We could not support this
   
   **Additional context**
   
   I am not hugely familiar with the precise inner-workings of the Hive ecosystem, as I've only interacted with tooling that uses it under-the-hood. I therefore could be mistaken on some aspect, if so please feel free to correct me :smile: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org