You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by tan shai <ta...@gmail.com> on 2016/07/07 09:25:07 UTC

Optimize filter operations with sorted data

Hi,

I have a sorted dataframe, I need to optimize the filter operations.
How does Spark performs filter operations on sorted dataframe?

It is scanning all the data?

Many thanks.

Re: Optimize filter operations with sorted data

Posted by tan shai <ta...@gmail.com>.
Yes it is operating on the sorted column

2016-07-07 11:43 GMT+02:00 Ted Yu <yu...@gmail.com>:

> Does the filter under consideration operate on sorted column(s) ?
>
> Cheers
>
> > On Jul 7, 2016, at 2:25 AM, tan shai <ta...@gmail.com> wrote:
> >
> > Hi,
> >
> > I have a sorted dataframe, I need to optimize the filter operations.
> > How does Spark performs filter operations on sorted dataframe?
> >
> > It is scanning all the data?
> >
> > Many thanks.
>

Re: Optimize filter operations with sorted data

Posted by Chanh Le <gi...@gmail.com>.
You can check in spark UI or in output of spark application.
How many stages and tasks before you partition and after.
Also compare the run time.

Regards,
Chanh

On Thu, Jul 7, 2016 at 6:40 PM, tan shai <ta...@gmail.com> wrote:

> How can you verify that it is loading only the part of time and network in
> filter ?
>
> 2016-07-07 11:58 GMT+02:00 Chanh Le <gi...@gmail.com>:
>
>> Hi Tan,
>> It depends on how data organise and what your filter is.
>> For example in my case: I store data by partition by field time and
>> network_id. If I filter by time or network_id or both and with other field
>> Spark only load part of time and network in filter then filter the rest.
>>
>>
>>
>> > On Jul 7, 2016, at 4:43 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > Does the filter under consideration operate on sorted column(s) ?
>> >
>> > Cheers
>> >
>> >> On Jul 7, 2016, at 2:25 AM, tan shai <ta...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a sorted dataframe, I need to optimize the filter operations.
>> >> How does Spark performs filter operations on sorted dataframe?
>> >>
>> >> It is scanning all the data?
>> >>
>> >> Many thanks.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>>
>

Re: Optimize filter operations with sorted data

Posted by tan shai <ta...@gmail.com>.
How can you verify that it is loading only the part of time and network in
filter ?

2016-07-07 11:58 GMT+02:00 Chanh Le <gi...@gmail.com>:

> Hi Tan,
> It depends on how data organise and what your filter is.
> For example in my case: I store data by partition by field time and
> network_id. If I filter by time or network_id or both and with other field
> Spark only load part of time and network in filter then filter the rest.
>
>
>
> > On Jul 7, 2016, at 4:43 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > Does the filter under consideration operate on sorted column(s) ?
> >
> > Cheers
> >
> >> On Jul 7, 2016, at 2:25 AM, tan shai <ta...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I have a sorted dataframe, I need to optimize the filter operations.
> >> How does Spark performs filter operations on sorted dataframe?
> >>
> >> It is scanning all the data?
> >>
> >> Many thanks.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>

Re: Optimize filter operations with sorted data

Posted by Chanh Le <gi...@gmail.com>.
Hi Tan,
It depends on how data organise and what your filter is.
For example in my case: I store data by partition by field time and network_id. If I filter by time or network_id or both and with other field Spark only load part of time and network in filter then filter the rest.



> On Jul 7, 2016, at 4:43 PM, Ted Yu <yu...@gmail.com> wrote:
> 
> Does the filter under consideration operate on sorted column(s) ?
> 
> Cheers
> 
>> On Jul 7, 2016, at 2:25 AM, tan shai <ta...@gmail.com> wrote:
>> 
>> Hi, 
>> 
>> I have a sorted dataframe, I need to optimize the filter operations.
>> How does Spark performs filter operations on sorted dataframe? 
>> 
>> It is scanning all the data? 
>> 
>> Many thanks. 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Optimize filter operations with sorted data

Posted by Ted Yu <yu...@gmail.com>.
Does the filter under consideration operate on sorted column(s) ?

Cheers

> On Jul 7, 2016, at 2:25 AM, tan shai <ta...@gmail.com> wrote:
> 
> Hi, 
> 
> I have a sorted dataframe, I need to optimize the filter operations.
> How does Spark performs filter operations on sorted dataframe? 
> 
> It is scanning all the data? 
> 
> Many thanks. 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org