You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Karshit Shah (Jira)" <ji...@apache.org> on 2021/11/04 18:26:00 UTC
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

    [ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438878#comment-17438878 ] 

Karshit Shah commented on CALCITE-2040:
---------------------------------------

I've been working on this issue with [~mmior]. I've a Java class that generates an arrow data file. Currently, I've the following gradle task to generate the data file.
{code:java}
task("runWithJavaExec", JavaExec::class) {

  main = "org.apache.calcite.adapter.arrow.ArrowData"

  classpath = sourceSets["test"].runtimeClasspath

}

{code}
I need to run "./gradlew arrow:runWithJavaExec" which generates the data file. However, I would like to add this task as a dependency to "./gradle arrow:test" so that that the file is generated before running the tests. But with limited knowledge of gradle, I'm not able to get that to work. It would be great if anyone can help me out with this.


 

> Create adapter for Apache Arrow
> -------------------------------
>
>                 Key: CALCITE-2040
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2040
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: arrow_data.py
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would allow people to execute SQL statements, via JDBC or ODBC, on data stored in Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, say, CSV files using the file adapter: an Arrow data set does not have a URL. (Unless we use Arrow's [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] format, or use an in-memory file system such as Alluxio.) So we would need to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it would also be good to have Arrow as a calling convention. That is, implementations of relational operators such as Filter, Project, Aggregate in addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in CALCITE-2025) it would make a lot of sense to translate those formats directly into Arrow (applying simple projects and filters first if applicable). Those adapters would belong as a "contrib" module in the Arrow project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)