You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/04/25 01:31:45 UTC
one hot encode a column of vector
how do I do one hot encode on a column of array? e.g. ['TG', 'CA']
FYI here's my code for one hot encoding normal categorical columns.
How do I make it work for a column of array?
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(flight3) for column in list(set['ColA',
'ColB', 'ColC'])]
pipeline = Pipeline(stages=indexers)
flight4 = pipeline.fit(flight3).transform(flight3)
Re: one hot encode a column of vector
Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
How about using countvectorizer?
http://spark.apache.org/docs/latest/ml-features.html#countvectorizer
On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu <ze...@gmail.com> wrote:
> how do I do one hot encode on a column of array? e.g. ['TG', 'CA']
>
>
> FYI here's my code for one hot encoding normal categorical columns. How do I make it work for a column of array?
>
>
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import StringIndexer
>
> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(flight3) for column in list(set['ColA', 'ColB', 'ColC'])]
>
> pipeline = Pipeline(stages=indexers)
> flight4 = pipeline.fit(flight3).transform(flight3)
>
>
>
>