You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Anthony Fox (JIRA)" <ji...@apache.org> on 2017/05/16 14:17:04 UTC
[jira] [Commented] (ARROW-1036) [C++] Define abstract API for
filtering Arrow streams (e.g. predicate evaluation)
[ https://issues.apache.org/jira/browse/ARROW-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012462#comment-16012462 ]
Anthony Fox commented on ARROW-1036:
------------------------------------
This is quite similar to a project the GeoMesa team has been working on.
The GeoMesa project has started to put together a SQL-like API over Arrow files in javascript for in-browser querying and visualization. We have defined a class called {{ArrowDataSet}} that wraps an Arrow file and exposes query and countBy/groupBy methods. The queries are defined using a set of simple predicate expressions ({{And}}, {{Or}}, {{Equals}}, {{LTEquals}},{{During}}, etc etc) with the idea of adding spatial predicates eventually ({{Contains}}, {{Intersects}}, {{Overlaps}}). The query is received by the {{ArrowDataSet}} and a query execution plan is produced. The query execution plan has the usual operators ({{Scan}}, {{Filter}}, {{Project}}, {{HashGroupBy}}) as well as optimized Filters for dictionary encoded values. We are also planning on having a primary sort key that is hinted through the Arrow column meta-data and appropriate optimizations with additional operators like {{PrimarySortKeyScan}}. This will help with seeks when there's a predicate on the primary sort key. For instance, if the primary sort key is {{date}} and there's a query predicate using {{During}} on a start and end date, then the execution plan will use {{PrimarySortKeyScan}} to efficiently skip batches till it reaches records that pass the predicate.
I'd be interested in how this can be standardized across the languages supported by Arrow.
> [C++] Define abstract API for filtering Arrow streams (e.g. predicate evaluation)
> ---------------------------------------------------------------------------------
>
> Key: ARROW-1036
> URL: https://issues.apache.org/jira/browse/ARROW-1036
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
>
> It would be useful to be able to apply analytic predicates to an Arrow stream in a composable way. As soon as we are able to compute some simple predicates on in-memory Arrow data, we could define our first version of this
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)