You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by janardhan shetty <ja...@gmail.com> on 2016/09/20 05:10:40 UTC

SPARK-10835 in 2.0

Hi,

I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835.

Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
appreciated ?

Note:
Pipeline has Ngram before word2Vec.

Error:
val word2Vec = new
Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)

scala> word2Vec.fit(grams)
java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
must be of type ArrayType(StringType,true) but was actually
ArrayType(StringType,false).
  at scala.Predef$.require(Predef.scala:224)
  at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
  at
org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
  at
org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
  at
org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
  at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)


Github code for Ngram:


override protected def validateInputType(inputType: DataType): Unit = {
    require(inputType.sameType(ArrayType(StringType)),
      s"Input type must be ArrayType(StringType) but got $inputType.")
  }

  override protected def outputDataType: DataType = new
ArrayType(StringType, false)
}

Re: SPARK-10835 in 2.0

Posted by Sean Owen <so...@cloudera.com>.
You can probably just do an identity transformation on the column to
make its type a nullable String array -- ArrayType(StringType, true).
Of course, I'm not sure why Word2Vec must reject a non-null array type
when it can of course handle nullable, but the previous discussion
indicated that this had to do with how UDFs work too.

On Tue, Sep 20, 2016 at 4:03 PM, janardhan shetty
<ja...@gmail.com> wrote:
> Hi Sean,
>
> Any suggestions for workaround as of now?
>
> On Sep 20, 2016 7:46 AM, "janardhan shetty" <ja...@gmail.com> wrote:
>>
>> Thanks Sean.
>>
>> On Sep 20, 2016 7:45 AM, "Sean Owen" <so...@cloudera.com> wrote:
>>>
>>> Ah, I think that this was supposed to be changed with SPARK-9062. Let
>>> me see about reopening 10835 and addressing it.
>>>
>>> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
>>> <ja...@gmail.com> wrote:
>>> > Is this a bug?
>>> >
>>> > On Sep 19, 2016 10:10 PM, "janardhan shetty" <ja...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I am hitting this issue.
>>> >> https://issues.apache.org/jira/browse/SPARK-10835.
>>> >>
>>> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround
>>> >> is
>>> >> appreciated ?
>>> >>
>>> >> Note:
>>> >> Pipeline has Ngram before word2Vec.
>>> >>
>>> >> Error:
>>> >> val word2Vec = new
>>> >>
>>> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)
>>> >>
>>> >> scala> word2Vec.fit(grams)
>>> >> java.lang.IllegalArgumentException: requirement failed: Column
>>> >> wordsGrams
>>> >> must be of type ArrayType(StringType,true) but was actually
>>> >> ArrayType(StringType,false).
>>> >>   at scala.Predef$.require(Predef.scala:224)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
>>> >>   at
>>> >> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>>> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>>> >>
>>> >>
>>> >> Github code for Ngram:
>>> >>
>>> >>
>>> >> override protected def validateInputType(inputType: DataType): Unit =
>>> >> {
>>> >>     require(inputType.sameType(ArrayType(StringType)),
>>> >>       s"Input type must be ArrayType(StringType) but got $inputType.")
>>> >>   }
>>> >>
>>> >>   override protected def outputDataType: DataType = new
>>> >> ArrayType(StringType, false)
>>> >> }
>>> >>
>>> >

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: SPARK-10835 in 2.0

Posted by janardhan shetty <ja...@gmail.com>.
Hi Sean,

Any suggestions for workaround as of now?
On Sep 20, 2016 7:46 AM, "janardhan shetty" <ja...@gmail.com> wrote:

> Thanks Sean.
> On Sep 20, 2016 7:45 AM, "Sean Owen" <so...@cloudera.com> wrote:
>
>> Ah, I think that this was supposed to be changed with SPARK-9062. Let
>> me see about reopening 10835 and addressing it.
>>
>> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
>> <ja...@gmail.com> wrote:
>> > Is this a bug?
>> >
>> > On Sep 19, 2016 10:10 PM, "janardhan shetty" <ja...@gmail.com>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am hitting this issue.
>> >> https://issues.apache.org/jira/browse/SPARK-10835.
>> >>
>> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
>> >> appreciated ?
>> >>
>> >> Note:
>> >> Pipeline has Ngram before word2Vec.
>> >>
>> >> Error:
>> >> val word2Vec = new
>> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("features"
>> ).setVectorSize(128).setMinCount(10)
>> >>
>> >> scala> word2Vec.fit(grams)
>> >> java.lang.IllegalArgumentException: requirement failed: Column
>> wordsGrams
>> >> must be of type ArrayType(StringType,true) but was actually
>> >> ArrayType(StringType,false).
>> >>   at scala.Predef$.require(Predef.scala:224)
>> >>   at
>> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(Schema
>> Utils.scala:42)
>> >>   at
>> >> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTr
>> ansformSchema(Word2Vec.scala:111)
>> >>   at
>> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSch
>> ema(Word2Vec.scala:121)
>> >>   at
>> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Ve
>> c.scala:187)
>> >>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.
>> scala:70)
>> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>> >>
>> >>
>> >> Github code for Ngram:
>> >>
>> >>
>> >> override protected def validateInputType(inputType: DataType): Unit = {
>> >>     require(inputType.sameType(ArrayType(StringType)),
>> >>       s"Input type must be ArrayType(StringType) but got $inputType.")
>> >>   }
>> >>
>> >>   override protected def outputDataType: DataType = new
>> >> ArrayType(StringType, false)
>> >> }
>> >>
>> >
>>
>

Re: SPARK-10835 in 2.0

Posted by janardhan shetty <ja...@gmail.com>.
Thanks Sean.
On Sep 20, 2016 7:45 AM, "Sean Owen" <so...@cloudera.com> wrote:

> Ah, I think that this was supposed to be changed with SPARK-9062. Let
> me see about reopening 10835 and addressing it.
>
> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
> <ja...@gmail.com> wrote:
> > Is this a bug?
> >
> > On Sep 19, 2016 10:10 PM, "janardhan shetty" <ja...@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> I am hitting this issue.
> >> https://issues.apache.org/jira/browse/SPARK-10835.
> >>
> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
> >> appreciated ?
> >>
> >> Note:
> >> Pipeline has Ngram before word2Vec.
> >>
> >> Error:
> >> val word2Vec = new
> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("
> features").setVectorSize(128).setMinCount(10)
> >>
> >> scala> word2Vec.fit(grams)
> >> java.lang.IllegalArgumentException: requirement failed: Column
> wordsGrams
> >> must be of type ArrayType(StringType,true) but was actually
> >> ArrayType(StringType,false).
> >>   at scala.Predef$.require(Predef.scala:224)
> >>   at
> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(
> SchemaUtils.scala:42)
> >>   at
> >> org.apache.spark.ml.feature.Word2VecBase$class.
> validateAndTransformSchema(Word2Vec.scala:111)
> >>   at
> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(
> Word2Vec.scala:121)
> >>   at
> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(
> Word2Vec.scala:187)
> >>   at org.apache.spark.ml.PipelineStage.transformSchema(
> Pipeline.scala:70)
> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
> >>
> >>
> >> Github code for Ngram:
> >>
> >>
> >> override protected def validateInputType(inputType: DataType): Unit = {
> >>     require(inputType.sameType(ArrayType(StringType)),
> >>       s"Input type must be ArrayType(StringType) but got $inputType.")
> >>   }
> >>
> >>   override protected def outputDataType: DataType = new
> >> ArrayType(StringType, false)
> >> }
> >>
> >
>

Re: SPARK-10835 in 2.0

Posted by Sean Owen <so...@cloudera.com>.
Ah, I think that this was supposed to be changed with SPARK-9062. Let
me see about reopening 10835 and addressing it.

On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
<ja...@gmail.com> wrote:
> Is this a bug?
>
> On Sep 19, 2016 10:10 PM, "janardhan shetty" <ja...@gmail.com> wrote:
>>
>> Hi,
>>
>> I am hitting this issue.
>> https://issues.apache.org/jira/browse/SPARK-10835.
>>
>> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
>> appreciated ?
>>
>> Note:
>> Pipeline has Ngram before word2Vec.
>>
>> Error:
>> val word2Vec = new
>> Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)
>>
>> scala> word2Vec.fit(grams)
>> java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
>> must be of type ArrayType(StringType,true) but was actually
>> ArrayType(StringType,false).
>>   at scala.Predef$.require(Predef.scala:224)
>>   at
>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>>   at
>> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
>>   at
>> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
>>   at
>> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
>>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>>
>>
>> Github code for Ngram:
>>
>>
>> override protected def validateInputType(inputType: DataType): Unit = {
>>     require(inputType.sameType(ArrayType(StringType)),
>>       s"Input type must be ArrayType(StringType) but got $inputType.")
>>   }
>>
>>   override protected def outputDataType: DataType = new
>> ArrayType(StringType, false)
>> }
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: SPARK-10835 in 2.0

Posted by janardhan shetty <ja...@gmail.com>.
Is this a bug?
On Sep 19, 2016 10:10 PM, "janardhan shetty" <ja...@gmail.com> wrote:

> Hi,
>
> I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835
> .
>
> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
> appreciated ?
>
> Note:
> Pipeline has Ngram before word2Vec.
>
> Error:
> val word2Vec = new Word2Vec().setInputCol("wordsGrams").setOutputCol("
> features").setVectorSize(128).setMinCount(10)
>
> scala> word2Vec.fit(grams)
> java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
> must be of type ArrayType(StringType,true) but was actually
> ArrayType(StringType,false).
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(
> SchemaUtils.scala:42)
>   at org.apache.spark.ml.feature.Word2VecBase$class.
> validateAndTransformSchema(Word2Vec.scala:111)
>   at org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(
> Word2Vec.scala:121)
>   at org.apache.spark.ml.feature.Word2Vec.transformSchema(
> Word2Vec.scala:187)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>
>
> Github code for Ngram:
>
>
> override protected def validateInputType(inputType: DataType): Unit = {
>     require(inputType.sameType(ArrayType(StringType)),
>       s"Input type must be ArrayType(StringType) but got $inputType.")
>   }
>
>   override protected def outputDataType: DataType = new
> ArrayType(StringType, false)
> }
>
>