You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/11 18:40:00 UTC
[jira] [Commented] (ARROW-16164) [C++] Pushdown filters on augmented columns like fragment filename
[ https://issues.apache.org/jira/browse/ARROW-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520760#comment-17520760 ]
Weston Pace commented on ARROW-16164:
-------------------------------------
So this is possible. And something like regex on filename might be interesting. However, I'm not terribly motivated to work on this because:
* In the above example the user could establish a partitioning on {{cyl}} and then just filter for {{cyl == 8}}.
* For more general filename filtering the user can often do this themselves by creating a dataset, getting the list of files, picking the files they want, and then creating a new dataset from the smaller list of files.
So it might be nice to first know of some key use cases that aren't solvable with other features.
> [C++] Pushdown filters on augmented columns like fragment filename
> ------------------------------------------------------------------
>
> Key: ARROW-16164
> URL: https://issues.apache.org/jira/browse/ARROW-16164
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
>
> In the discussion on ARROW-15260, if we run the following code in R, we might expect it to push down the filter so we can just read in the relevant files:
> {code:r}
> filter = Expression$create(
> "match_substring",
> Expression$field_ref("__filename"),
> options = list(pattern = "cyl=8")
> )
> {code}
> As mentioned by [~westonpace]:
> "You might think we would get the hint and only read files matching that pattern. This is not the case. We will read the entire dataset and apply the "cyl=8" filter in memory.
> If we want to pushdown filters on the filename column we will need to add some special logic."
--
This message was sent by Atlassian Jira
(v8.20.1#820001)