You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Md. Rezaul Karim" <re...@insight-centre.org> on 2017/10/27 22:28:46 UTC
StringIndexer on several columns in a DataFrame with Scala
Hi All,
There are several categorical columns in my dataset as follows:
[image: Inline images 1]
How can I transform values in each (categorical) columns into numeric using
StringIndexer so that the resulting DataFrame can be feed into
VectorAssembler to generate a feature vector?
A naive approach that I can try using StringIndexer for each categorical
column. But that sounds hilarious, I know.
A possible workaround
<https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
PySpark is combining several StringIndexer on a list and use a Pipeline to
execute them all as follows:
from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(df) for column in
list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
How I can do the same in Scala? I tried the following:
val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null
for (colName <- featureCol) {
val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
indexers = indexers :+ index
}
val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()
However, I am experiencing NullPointerException at
for (colName <- featureCol)
I am sure, I am doing something wrong. Any suggestion?
Regards,
_________________________________
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>
Re: StringIndexer on several columns in a DataFrame with Scala
Posted by "Md. Rezaul Karim" <re...@insight-centre.org>.
Hi Nick,
Both approaches worked and I realized my silly mistake too. Thank you so
much.
@Xu, thanks for the update.
Best,
Regards,
_________________________________
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>
On 30 October 2017 at 10:40, Weichen Xu <we...@databricks.com> wrote:
> Yes I am working on this. Sorry for late, but I will try to submit PR
> ASAP. Thanks!
>
> On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> For now, you must follow this approach of constructing a pipeline
>> consisting of a StringIndexer for each categorical column. See
>> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA
>> to allow multiple columns for StringIndexer, which is being worked on
>> currently.
>>
>> The reason you're seeing a NPE is:
>>
>> var indexers: Array[StringIndexer] = null
>>
>> and then you're trying to append an element to something that is null.
>>
>> Try this instead:
>>
>> var indexers: Array[StringIndexer] = Array()
>>
>>
>> But even better is a more functional approach:
>>
>> val indexers = featureCol.map { colName =>
>>
>> new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
>>
>> }
>>
>>
>> On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
>> rezaul.karim@insight-centre.org> wrote:
>>
>>> Hi All,
>>>
>>> There are several categorical columns in my dataset as follows:
>>> [image: grafik.png]
>>>
>>> How can I transform values in each (categorical) columns into numeric
>>> using StringIndexer so that the resulting DataFrame can be feed into
>>> VectorAssembler to generate a feature vector?
>>>
>>> A naive approach that I can try using StringIndexer for each
>>> categorical column. But that sounds hilarious, I know.
>>> A possible workaround
>>> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
>>> PySpark is combining several StringIndexer on a list and use a Pipeline
>>> to execute them all as follows:
>>>
>>> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
>>> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
>>> pipeline = Pipeline(stages=indexers)
>>> df_r = pipeline.fit(df).transform(df)
>>> df_r.show()
>>>
>>> How I can do the same in Scala? I tried the following:
>>>
>>> val featureCol = trainingDF.columns
>>> var indexers: Array[StringIndexer] = null
>>>
>>> for (colName <- featureCol) {
>>> val index = new StringIndexer()
>>> .setInputCol(colName)
>>> .setOutputCol(colName + "_indexed")
>>> //.fit(trainDF)
>>> indexers = indexers :+ index
>>> }
>>>
>>> val pipeline = new Pipeline()
>>> .setStages(indexers)
>>> val newDF = pipeline.fit(trainingDF).transform(trainingDF)
>>> newDF.show()
>>>
>>> However, I am experiencing NullPointerException at
>>>
>>> for (colName <- featureCol)
>>>
>>> I am sure, I am doing something wrong. Any suggestion?
>>>
>>>
>>>
>>> Regards,
>>> _________________________________
>>> *Md. Rezaul Karim*, BSc, MSc
>>> Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> <http://139.59.184.114/index.html>
>>>
>>
>
Re: StringIndexer on several columns in a DataFrame with Scala
Posted by Weichen Xu <we...@databricks.com>.
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP.
Thanks!
On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <ni...@gmail.com>
wrote:
> For now, you must follow this approach of constructing a pipeline
> consisting of a StringIndexer for each categorical column. See
> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
> allow multiple columns for StringIndexer, which is being worked on
> currently.
>
> The reason you're seeing a NPE is:
>
> var indexers: Array[StringIndexer] = null
>
> and then you're trying to append an element to something that is null.
>
> Try this instead:
>
> var indexers: Array[StringIndexer] = Array()
>
>
> But even better is a more functional approach:
>
> val indexers = featureCol.map { colName =>
>
> new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
>
> }
>
>
> On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
> rezaul.karim@insight-centre.org> wrote:
>
>> Hi All,
>>
>> There are several categorical columns in my dataset as follows:
>> [image: grafik.png]
>>
>> How can I transform values in each (categorical) columns into numeric
>> using StringIndexer so that the resulting DataFrame can be feed into
>> VectorAssembler to generate a feature vector?
>>
>> A naive approach that I can try using StringIndexer for each categorical
>> column. But that sounds hilarious, I know.
>> A possible workaround
>> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
>> PySpark is combining several StringIndexer on a list and use a Pipeline
>> to execute them all as follows:
>>
>> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
>> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
>> pipeline = Pipeline(stages=indexers)
>> df_r = pipeline.fit(df).transform(df)
>> df_r.show()
>>
>> How I can do the same in Scala? I tried the following:
>>
>> val featureCol = trainingDF.columns
>> var indexers: Array[StringIndexer] = null
>>
>> for (colName <- featureCol) {
>> val index = new StringIndexer()
>> .setInputCol(colName)
>> .setOutputCol(colName + "_indexed")
>> //.fit(trainDF)
>> indexers = indexers :+ index
>> }
>>
>> val pipeline = new Pipeline()
>> .setStages(indexers)
>> val newDF = pipeline.fit(trainingDF).transform(trainingDF)
>> newDF.show()
>>
>> However, I am experiencing NullPointerException at
>>
>> for (colName <- featureCol)
>>
>> I am sure, I am doing something wrong. Any suggestion?
>>
>>
>>
>> Regards,
>> _________________________________
>> *Md. Rezaul Karim*, BSc, MSc
>> Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>
Re: StringIndexer on several columns in a DataFrame with Scala
Posted by Nick Pentreath <ni...@gmail.com>.
For now, you must follow this approach of constructing a pipeline
consisting of a StringIndexer for each categorical column. See
https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
allow multiple columns for StringIndexer, which is being worked on
currently.
The reason you're seeing a NPE is:
var indexers: Array[StringIndexer] = null
and then you're trying to append an element to something that is null.
Try this instead:
var indexers: Array[StringIndexer] = Array()
But even better is a more functional approach:
val indexers = featureCol.map { colName =>
new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
}
On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
rezaul.karim@insight-centre.org> wrote:
> Hi All,
>
> There are several categorical columns in my dataset as follows:
> [image: grafik.png]
>
> How can I transform values in each (categorical) columns into numeric
> using StringIndexer so that the resulting DataFrame can be feed into
> VectorAssembler to generate a feature vector?
>
> A naive approach that I can try using StringIndexer for each categorical
> column. But that sounds hilarious, I know.
> A possible workaround
> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
> PySpark is combining several StringIndexer on a list and use a Pipeline
> to execute them all as follows:
>
> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
> pipeline = Pipeline(stages=indexers)
> df_r = pipeline.fit(df).transform(df)
> df_r.show()
>
> How I can do the same in Scala? I tried the following:
>
> val featureCol = trainingDF.columns
> var indexers: Array[StringIndexer] = null
>
> for (colName <- featureCol) {
> val index = new StringIndexer()
> .setInputCol(colName)
> .setOutputCol(colName + "_indexed")
> //.fit(trainDF)
> indexers = indexers :+ index
> }
>
> val pipeline = new Pipeline()
> .setStages(indexers)
> val newDF = pipeline.fit(trainingDF).transform(trainingDF)
> newDF.show()
>
> However, I am experiencing NullPointerException at
>
> for (colName <- featureCol)
>
> I am sure, I am doing something wrong. Any suggestion?
>
>
>
> Regards,
> _________________________________
> *Md. Rezaul Karim*, BSc, MSc
> Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>