You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/04/25 01:31:45 UTC

one hot encode a column of vector

how do I do one hot encode on a column of array? e.g. ['TG', 'CA']


FYI here's my code for one hot encoding normal categorical columns.
How do I make it work for a column of array?


from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(flight3) for column in list(set['ColA',
'ColB', 'ColC'])]

pipeline = Pipeline(stages=indexers)
flight4 = pipeline.fit(flight3).transform(flight3)

Re: one hot encode a column of vector

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

How about using countvectorizer?
http://spark.apache.org/docs/latest/ml-features.html#countvectorizer





On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu <ze...@gmail.com> wrote:

> how do I do one hot encode on a column of array? e.g. ['TG', 'CA']
>
>
> FYI here's my code for one hot encoding normal categorical columns. How do I make it work for a column of array?
>
>
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import StringIndexer
>
> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(flight3) for column in list(set['ColA', 'ColB', 'ColC'])]
>
> pipeline = Pipeline(stages=indexers)
> flight4 = pipeline.fit(flight3).transform(flight3)
>
>
>
>