You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ju...@free.fr on 2018/05/31 08:34:51 UTC

Fastest way to drop useless columns

Hi there !

I have a potentially large dataset ( regarding number of rows and cols )

And I want to find the fastest way to drop some useless cols for me, 
i.e. cols containing only an unique value !

I want to know what do you think that I could do to do this as fast as 
possible using spark.


I already have a solution using distinct().count() or 
approxCountDistinct()
But, they may not be the best choice as this requires to go through all 
the data, even if the 2 first tested values for a col are already 
different ( and in this case I know that I can keep the col )


Thx for your ideas !

Julien

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Fastest way to drop useless columns

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Julien,

One quick and easy to implement idea is to use sampling on your dataset,
i.e., sample a large enough subset of your data and test is there are no
unique values on some columns. Repeat the process a few times and then do
the full test on the surviving columns.

This will allow you to load only a subset of your dataset if it is stored
in Parquet.

Best,
Anastasios

On Thu, May 31, 2018 at 10:34 AM, <ju...@free.fr> wrote:

> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way to drop some useless cols for me, i.e.
> cols containing only an unique value !
>
> I want to know what do you think that I could do to do this as fast as
> possible using spark.
>
>
> I already have a solution using distinct().count() or approxCountDistinct()
> But, they may not be the best choice as this requires to go through all
> the data, even if the 2 first tested values for a col are already different
> ( and in this case I know that I can keep the col )
>
>
> Thx for your ideas !
>
> Julien
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

Re: Fastest way to drop useless columns

Posted by devjyoti patra <dj...@gmail.com>.

One thing that we do on our datasets is :
1. Take 'n' random samples of equal size
2. If the distribution is heavily skewed for one key in your samples. The
way we define "heavy skewness" is; if the mean is more than one std
deviation away from the median.

In your case, you can drop this column.

On Thu, 31 May 2018, 14:55 , <ju...@free.fr> wrote:

> I believe this only works when we need to drop duplicate ROWS
>
> Here I want to drop cols which contains one unique value
>
>
> Le 2018-05-31 11:16, Divya Gehlot a écrit :
> > you can try dropduplicate function
> >
> >
> https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
> >
> > On 31 May 2018 at 16:34, <ju...@free.fr> wrote:
> >
> >> Hi there !
> >>
> >> I have a potentially large dataset ( regarding number of rows and
> >> cols )
> >>
> >> And I want to find the fastest way to drop some useless cols for me,
> >> i.e. cols containing only an unique value !
> >>
> >> I want to know what do you think that I could do to do this as fast
> >> as possible using spark.
> >>
> >> I already have a solution using distinct().count() or
> >> approxCountDistinct()
> >> But, they may not be the best choice as this requires to go through
> >> all the data, even if the 2 first tested values for a col are
> >> already different ( and in this case I know that I can keep the col
> >> )
> >>
> >> Thx for your ideas !
> >>
> >> Julien
> >>
> >>
> > ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Fastest way to drop useless columns

Posted by ju...@free.fr.

I believe this only works when we need to drop duplicate ROWS

Here I want to drop cols which contains one unique value


Le 2018-05-31 11:16, Divya Gehlot a écrit :
> you can try dropduplicate function
> 
> https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
> 
> On 31 May 2018 at 16:34, <ju...@free.fr> wrote:
> 
>> Hi there !
>> 
>> I have a potentially large dataset ( regarding number of rows and
>> cols )
>> 
>> And I want to find the fastest way to drop some useless cols for me,
>> i.e. cols containing only an unique value !
>> 
>> I want to know what do you think that I could do to do this as fast
>> as possible using spark.
>> 
>> I already have a solution using distinct().count() or
>> approxCountDistinct()
>> But, they may not be the best choice as this requires to go through
>> all the data, even if the 2 first tested values for a col are
>> already different ( and in this case I know that I can keep the col
>> )
>> 
>> Thx for your ideas !
>> 
>> Julien
>> 
>> 
> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Fastest way to drop useless columns

Posted by Divya Gehlot <di...@gmail.com>.

you can try dropduplicate function

https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala

On 31 May 2018 at 16:34, <ju...@free.fr> wrote:

> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way to drop some useless cols for me, i.e.
> cols containing only an unique value !
>
> I want to know what do you think that I could do to do this as fast as
> possible using spark.
>
>
> I already have a solution using distinct().count() or approxCountDistinct()
> But, they may not be the best choice as this requires to go through all
> the data, even if the 2 first tested values for a col are already different
> ( and in this case I know that I can keep the col )
>
>
> Thx for your ideas !
>
> Julien
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>