You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/11 18:40:00 UTC

[jira] [Commented] (ARROW-16164) [C++] Pushdown filters on augmented columns like fragment filename

    [ https://issues.apache.org/jira/browse/ARROW-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520760#comment-17520760 ] 

Weston Pace commented on ARROW-16164:
-------------------------------------

So this is possible.  And something like regex on filename might be interesting.  However, I'm not terribly motivated to work on this because:

 * In the above example the user could establish a partitioning on {{cyl}} and then just filter for {{cyl == 8}}.
 * For more general filename filtering the user can often do this themselves by creating a dataset, getting the list of files, picking the files they want, and then creating a new dataset from the smaller list of files.

So it might be nice to first know of some key use cases that aren't solvable with other features.

> [C++] Pushdown filters on augmented columns like fragment filename
> ------------------------------------------------------------------
>
>                 Key: ARROW-16164
>                 URL: https://issues.apache.org/jira/browse/ARROW-16164
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>
> In the discussion on ARROW-15260, if we run the following code in R, we might expect it to push down the filter so we can just read in the relevant files:
> {code:r}
>   filter = Expression$create(
>     "match_substring",
>     Expression$field_ref("__filename"),
>     options = list(pattern = "cyl=8")
>   )
> {code}
> As mentioned by [~westonpace]:
> "You might think we would get the hint and only read files matching that pattern. This is not the case. We will read the entire dataset and apply the "cyl=8" filter in memory.
> If we want to pushdown filters on the filename column we will need to add some special logic."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)