You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Lothian (JIRA)" <ji...@apache.org> on 2017/05/22 12:01:04 UTC
[jira] [Created] (SPARK-20838) Spark ML ngram feature extractor
should support ngram range like scikit
Nick Lothian created SPARK-20838:
------------------------------------
Summary: Spark ML ngram feature extractor should support ngram range like scikit
Key: SPARK-20838
URL: https://issues.apache.org/jira/browse/SPARK-20838
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 2.1.1
Reporter: Nick Lothian
Currently Spark ML ngram extractor requires an ngram size (which default to 2).
This means that to tokenize to words, bigrams and trigrams (which is pretty common) you need a pipeline like this:
tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words")
bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams")
trigram = NGram(n=3, inputCol=remover.getOutputCol(), outputCol="trigrams")
pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram])
That's not terrible, but the big problem is that the words, bigrams and trigrams end up in separate fields, and the only way (in pyspark) to combine them is to explode each of the words, bigrams and trigrams field and then union them together.
In my experience this means it is slower to use this for feature extraction than to use a python UDF. This seems preposterous!
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org