You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tobi Bosede <an...@gmail.com> on 2016/07/14 20:23:38 UTC

Filtering RDD Using Spark.mllib's ChiSqSelector

Hi everyone,

I am trying to filter my features based on the spark.mllib ChiSqSelector.

filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
model.transform(lp.features)))

However when I do the following I get the error below. Is there any other
way to filter my data to avoid this error?

filteredDataDF=filteredData.toDF()

Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transforamtion.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.


I would directly use the spark.ml ChiSqSelector and work with
dataframes, but I am on spark 1.4 and using pyspark. So spark.ml's
ChiSqSelector is not available to me. filteredData is of type
piplelineRDD, if that helps. It is not a regular RDD. I think that may
part of why calling toDF() is not working.


Thanks,

Tobi

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

Posted by Tobi Bosede <an...@gmail.com>.

Thanks Yanbo, will try that!

On Sun, Jul 17, 2016 at 10:26 PM, Yanbo Liang <yb...@gmail.com> wrote:

> Hi Tobi,
>
> Thanks for clarifying the question. It's very straight forward to convert
> the filtered RDD to DataFrame, you can refer the following code snippets:
>
> from pyspark.sql import Row
>
> rdd2 = filteredRDD.map(lambda v: Row(features=v))
>
> df = rdd2.toDF()
>
>
> Thanks
> Yanbo
>
> 2016-07-16 14:51 GMT-07:00 Tobi Bosede <an...@gmail.com>:
>
>> Hi Yanbo,
>>
>> Appreciate the response. I might not have phrased this correctly, but I
>> really wanted to know how to convert the pipeline rdd into a data frame. I
>> have seen the example you posted. However I need to transform all my data,
>> just not 1 line. So I did sucessfully use map to use the chisq selector to
>> filter the chosen features of my data. I just want to convert it to a df so
>> I can apply a logistic regression model from spark.ml.
>>
>> Trust me I would use the dataframes api if I could, but the chisq
>> functionality is not available to me in the python spark 1.4 api.
>>
>> Regards,
>> Tobi
>>
>> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yb...@gmail.com> wrote:
>>
>>> Hi Tobi,
>>>
>>> The MLlib RDD-based API does support to apply transformation on both
>>> Vector and RDD, but you did not use the appropriate way to do.
>>> Suppose you have a RDD with LabeledPoint in each line, you can refer the
>>> following code snippets to train a ChiSqSelectorModel model and do
>>> transformation:
>>>
>>> from pyspark.mllib.regression import LabeledPoint
>>>
>>> from pyspark.mllib.feature import ChiSqSelector
>>>
>>> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])]
>>>
>>> rdd = sc.parallelize(data)
>>>
>>> model = ChiSqSelector(1).fit(rdd)
>>>
>>> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>>>
>>> filteredRDD.collect()
>>>
>>> However, we strongly recommend you to migrate to DataFrame-based API
>>> since the RDD-based API is switched to maintain mode.
>>>
>>> Thanks
>>> Yanbo
>>>
>>> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <an...@gmail.com>:
>>>
>>>> Hi everyone,
>>>>
>>>> I am trying to filter my features based on the spark.mllib
>>>> ChiSqSelector.
>>>>
>>>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>>>> model.transform(lp.features)))
>>>>
>>>> However when I do the following I get the error below. Is there any
>>>> other way to filter my data to avoid this error?
>>>>
>>>> filteredDataDF=filteredData.toDF()
>>>>
>>>> Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
>>>>
>>>>
>>>> I would directly use the spark.ml ChiSqSelector and work with dataframes, but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me. filteredData is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of why calling toDF() is not working.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Tobi
>>>>
>>>>
>>>
>

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Tobi,

Thanks for clarifying the question. It's very straight forward to convert
the filtered RDD to DataFrame, you can refer the following code snippets:

from pyspark.sql import Row

rdd2 = filteredRDD.map(lambda v: Row(features=v))

df = rdd2.toDF()


Thanks
Yanbo

2016-07-16 14:51 GMT-07:00 Tobi Bosede <an...@gmail.com>:

> Hi Yanbo,
>
> Appreciate the response. I might not have phrased this correctly, but I
> really wanted to know how to convert the pipeline rdd into a data frame. I
> have seen the example you posted. However I need to transform all my data,
> just not 1 line. So I did sucessfully use map to use the chisq selector to
> filter the chosen features of my data. I just want to convert it to a df so
> I can apply a logistic regression model from spark.ml.
>
> Trust me I would use the dataframes api if I could, but the chisq
> functionality is not available to me in the python spark 1.4 api.
>
> Regards,
> Tobi
>
> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yb...@gmail.com> wrote:
>
>> Hi Tobi,
>>
>> The MLlib RDD-based API does support to apply transformation on both
>> Vector and RDD, but you did not use the appropriate way to do.
>> Suppose you have a RDD with LabeledPoint in each line, you can refer the
>> following code snippets to train a ChiSqSelectorModel model and do
>> transformation:
>>
>> from pyspark.mllib.regression import LabeledPoint
>>
>> from pyspark.mllib.feature import ChiSqSelector
>>
>> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])]
>>
>> rdd = sc.parallelize(data)
>>
>> model = ChiSqSelector(1).fit(rdd)
>>
>> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>>
>> filteredRDD.collect()
>>
>> However, we strongly recommend you to migrate to DataFrame-based API
>> since the RDD-based API is switched to maintain mode.
>>
>> Thanks
>> Yanbo
>>
>> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <an...@gmail.com>:
>>
>>> Hi everyone,
>>>
>>> I am trying to filter my features based on the spark.mllib
>>> ChiSqSelector.
>>>
>>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>>> model.transform(lp.features)))
>>>
>>> However when I do the following I get the error below. Is there any
>>> other way to filter my data to avoid this error?
>>>
>>> filteredDataDF=filteredData.toDF()
>>>
>>> Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
>>>
>>>
>>> I would directly use the spark.ml ChiSqSelector and work with dataframes, but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me. filteredData is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of why calling toDF() is not working.
>>>
>>>
>>> Thanks,
>>>
>>> Tobi
>>>
>>>
>>

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

Posted by Tobi Bosede <an...@gmail.com>.

Hi Yanbo,

Appreciate the response. I might not have phrased this correctly, but I
really wanted to know how to convert the pipeline rdd into a data frame. I
have seen the example you posted. However I need to transform all my data,
just not 1 line. So I did sucessfully use map to use the chisq selector to
filter the chosen features of my data. I just want to convert it to a df so
I can apply a logistic regression model from spark.ml.

Trust me I would use the dataframes api if I could, but the chisq
functionality is not available to me in the python spark 1.4 api.

Regards,
Tobi

On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yb...@gmail.com> wrote:

> Hi Tobi,
>
> The MLlib RDD-based API does support to apply transformation on both
> Vector and RDD, but you did not use the appropriate way to do.
> Suppose you have a RDD with LabeledPoint in each line, you can refer the
> following code snippets to train a ChiSqSelectorModel model and do
> transformation:
>
> from pyspark.mllib.regression import LabeledPoint
>
> from pyspark.mllib.feature import ChiSqSelector
>
> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])]
>
> rdd = sc.parallelize(data)
>
> model = ChiSqSelector(1).fit(rdd)
>
> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>
> filteredRDD.collect()
>
> However, we strongly recommend you to migrate to DataFrame-based API since
> the RDD-based API is switched to maintain mode.
>
> Thanks
> Yanbo
>
> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <an...@gmail.com>:
>
>> Hi everyone,
>>
>> I am trying to filter my features based on the spark.mllib ChiSqSelector.
>>
>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>> model.transform(lp.features)))
>>
>> However when I do the following I get the error below. Is there any other
>> way to filter my data to avoid this error?
>>
>> filteredDataDF=filteredData.toDF()
>>
>> Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
>>
>>
>> I would directly use the spark.ml ChiSqSelector and work with dataframes, but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me. filteredData is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of why calling toDF() is not working.
>>
>>
>> Thanks,
>>
>> Tobi
>>
>>
>

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Tobi,

The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do
transformation:

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.feature import ChiSqSelector

data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0,
5.0])]

rdd = sc.parallelize(data)

model = ChiSqSelector(1).fit(rdd)

filteredRDD = model.transform(rdd.map(lambda lp: lp.features))

filteredRDD.collect()

However, we strongly recommend you to migrate to DataFrame-based API since
the RDD-based API is switched to maintain mode.

Thanks
Yanbo

2016-07-14 13:23 GMT-07:00 Tobi Bosede <an...@gmail.com>:

> Hi everyone,
>
> I am trying to filter my features based on the spark.mllib ChiSqSelector.
>
> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
> model.transform(lp.features)))
>
> However when I do the following I get the error below. Is there any other
> way to filter my data to avoid this error?
>
> filteredDataDF=filteredData.toDF()
>
> Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
>
>
> I would directly use the spark.ml ChiSqSelector and work with dataframes, but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me. filteredData is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of why calling toDF() is not working.
>
>
> Thanks,
>
> Tobi
>
>