You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Abhishek Anand <ab...@gmail.com> on 2016/05/30 09:06:17 UTC

Running glm in sparkR (data pre-processing step)

Hi ,

I want to run glm variant of sparkR for my data that is there in a csv file.

I see that the glm function in sparkR takes a spark dataframe as input.

Now, when I read a file from csv and create a spark dataframe, how could I
take care of the factor variables/columns in my data ?

Do I need to convert it to a R dataframe, convert to factor using as.factor
and create spark dataframe and run glm over it ?

But, running as.factor over big dataset is not possible.

Please suggest what is the best way to acheive this ?

What pre-processing should be done, and what is the best way to achieve it
 ?


Thanks,
Abhi

Re: Running glm in sparkR (data pre-processing step)

Posted by Yanbo Liang <yb...@gmail.com>.

Yes, you are right.

2016-05-30 2:34 GMT-07:00 Abhishek Anand <ab...@gmail.com>:

>
> Thanks Yanbo.
>
> So, you mean that if I have a variable which is of type double but I want
> to treat it like String in my model I just have to cast those columns into
> string and simply run the glm model. String columns will be directly
> one-hot encoded by the glm provided by sparkR ?
>
> Just wanted to clarify as in R we need to apply as.factor for categorical
> variables.
>
> val dfNew = df.withColumn("C0",df.col("C0").cast("String"))
>
>
> Abhi !!
>
> On Mon, May 30, 2016 at 2:58 PM, Yanbo Liang <yb...@gmail.com> wrote:
>
>> Hi Abhi,
>>
>> In SparkR glm, category features (columns of type string) will be one-hot
>> encoded automatically.
>> So pre-processing like `as.factor` is not necessary, you can directly
>> feed your data to the model training.
>>
>> Thanks
>> Yanbo
>>
>> 2016-05-30 2:06 GMT-07:00 Abhishek Anand <ab...@gmail.com>:
>>
>>> Hi ,
>>>
>>> I want to run glm variant of sparkR for my data that is there in a csv
>>> file.
>>>
>>> I see that the glm function in sparkR takes a spark dataframe as input.
>>>
>>> Now, when I read a file from csv and create a spark dataframe, how could
>>> I take care of the factor variables/columns in my data ?
>>>
>>> Do I need to convert it to a R dataframe, convert to factor using
>>> as.factor and create spark dataframe and run glm over it ?
>>>
>>> But, running as.factor over big dataset is not possible.
>>>
>>> Please suggest what is the best way to acheive this ?
>>>
>>> What pre-processing should be done, and what is the best way to achieve
>>> it  ?
>>>
>>>
>>> Thanks,
>>> Abhi
>>>
>>
>>
>

Re: Running glm in sparkR (data pre-processing step)

Posted by Abhishek Anand <ab...@gmail.com>.

Thanks Yanbo.

So, you mean that if I have a variable which is of type double but I want
to treat it like String in my model I just have to cast those columns into
string and simply run the glm model. String columns will be directly
one-hot encoded by the glm provided by sparkR ?

Just wanted to clarify as in R we need to apply as.factor for categorical
variables.

val dfNew = df.withColumn("C0",df.col("C0").cast("String"))


Abhi !!

On Mon, May 30, 2016 at 2:58 PM, Yanbo Liang <yb...@gmail.com> wrote:

> Hi Abhi,
>
> In SparkR glm, category features (columns of type string) will be one-hot
> encoded automatically.
> So pre-processing like `as.factor` is not necessary, you can directly feed
> your data to the model training.
>
> Thanks
> Yanbo
>
> 2016-05-30 2:06 GMT-07:00 Abhishek Anand <ab...@gmail.com>:
>
>> Hi ,
>>
>> I want to run glm variant of sparkR for my data that is there in a csv
>> file.
>>
>> I see that the glm function in sparkR takes a spark dataframe as input.
>>
>> Now, when I read a file from csv and create a spark dataframe, how could
>> I take care of the factor variables/columns in my data ?
>>
>> Do I need to convert it to a R dataframe, convert to factor using
>> as.factor and create spark dataframe and run glm over it ?
>>
>> But, running as.factor over big dataset is not possible.
>>
>> Please suggest what is the best way to acheive this ?
>>
>> What pre-processing should be done, and what is the best way to achieve
>> it  ?
>>
>>
>> Thanks,
>> Abhi
>>
>
>

Re: Running glm in sparkR (data pre-processing step)

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Abhi,

In SparkR glm, category features (columns of type string) will be one-hot
encoded automatically.
So pre-processing like `as.factor` is not necessary, you can directly feed
your data to the model training.

Thanks
Yanbo

2016-05-30 2:06 GMT-07:00 Abhishek Anand <ab...@gmail.com>:

> Hi ,
>
> I want to run glm variant of sparkR for my data that is there in a csv
> file.
>
> I see that the glm function in sparkR takes a spark dataframe as input.
>
> Now, when I read a file from csv and create a spark dataframe, how could I
> take care of the factor variables/columns in my data ?
>
> Do I need to convert it to a R dataframe, convert to factor using
> as.factor and create spark dataframe and run glm over it ?
>
> But, running as.factor over big dataset is not possible.
>
> Please suggest what is the best way to acheive this ?
>
> What pre-processing should be done, and what is the best way to achieve it
>  ?
>
>
> Thanks,
> Abhi
>