You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Anthony Fox (JIRA)" <ji...@apache.org> on 2017/05/16 14:17:04 UTC

[jira] [Commented] (ARROW-1036) [C++] Define abstract API for filtering Arrow streams (e.g. predicate evaluation)

    [ https://issues.apache.org/jira/browse/ARROW-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012462#comment-16012462 ] 

Anthony Fox commented on ARROW-1036:
------------------------------------

This is quite similar to a project the GeoMesa team has been working on.

The GeoMesa project has started to put together a SQL-like API over Arrow files in javascript for in-browser querying and visualization.  We have defined a class called {{ArrowDataSet}} that wraps an Arrow file and exposes query and countBy/groupBy methods.  The queries are defined using a set of simple predicate expressions ({{And}}, {{Or}}, {{Equals}}, {{LTEquals}},{{During}}, etc etc) with the idea of adding spatial predicates eventually ({{Contains}}, {{Intersects}}, {{Overlaps}}).  The query is received by the {{ArrowDataSet}} and a query execution plan is produced.  The query execution plan has the usual operators ({{Scan}}, {{Filter}}, {{Project}}, {{HashGroupBy}}) as well as optimized Filters for dictionary encoded values.  We are also planning on having a primary sort key that is hinted through the Arrow column meta-data and appropriate optimizations with additional operators like {{PrimarySortKeyScan}}.  This will help with seeks when there's a predicate on the primary sort key.  For instance, if the primary sort key is {{date}} and there's a query predicate using {{During}} on a start and end date, then the execution plan will use {{PrimarySortKeyScan}} to efficiently skip batches till it reaches records that pass the predicate.

I'd be interested in how this can be standardized across the languages supported by Arrow.

> [C++] Define abstract API for filtering Arrow streams (e.g. predicate evaluation)
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-1036
>                 URL: https://issues.apache.org/jira/browse/ARROW-1036
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>
> It would be useful to be able to apply analytic predicates to an Arrow stream in a composable way. As soon as we are able to compute some simple predicates on in-memory Arrow data, we could define our first version of this



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)