You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rahul Nandi <ra...@gmail.com> on 2017/03/31 05:45:22 UTC

How to PushDown ParquetFilter Spark 2.0.1 dataframe

Hi,
I have around 2 million data as parquet file in s3. The file structure is
somewhat like
id data
1 abc
2 cdf
3 fas
Now I want to filter and take the records where the id matches with my
required Id.

val requiredDataId = Array(1,2) //Might go upto 100s of records.

df.filter(requiredDataId.contains("id"))

This is my use case.

What will be best way to do this in spark 2.0.1 where I can also pushDown
the filter to parquet?



Thanks and Regards,
Rahul

Re: How to PushDown ParquetFilter Spark 2.0.1 dataframe

Posted by Hanumath Rao Maduri <ha...@gmail.com>.

Hello Rahul,

Please try to use df.filter(df("id").isin(1,2))

Thanks,

On Thu, Mar 30, 2017 at 10:45 PM, Rahul Nandi <ra...@gmail.com>
wrote:

> Hi,
> I have around 2 million data as parquet file in s3. The file structure is
> somewhat like
> id data
> 1 abc
> 2 cdf
> 3 fas
> Now I want to filter and take the records where the id matches with my
> required Id.
>
> val requiredDataId = Array(1,2) //Might go upto 100s of records.
>
> df.filter(requiredDataId.contains("id"))
>
> This is my use case.
>
> What will be best way to do this in spark 2.0.1 where I can also pushDown
> the filter to parquet?
>
>
>
> Thanks and Regards,
> Rahul
>
>