You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Meeraj Kunnumpurath <me...@servicesymphony.com> on 2016/11/19 18:10:00 UTC

Logistic Regression Match Error

Hello,

I have the following code that trains a mapping of review text to ratings.
I use a tokenizer to get all the words from the review, and use a count
vectorizer to get all the words. However, when I train the classifier I get
a match error. Any pointers will be very helpful.

The code is below,

val spark = SparkSession.builder().appName("Logistic
Regression").master("local").getOrCreate()
import spark.implicits._

val df = spark.read.option("header", "true").option("inferSchema",
"true").csv("data/amazon_baby.csv")
val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")

val isGood = udf((x: Int) => if (x >= 4) 1 else 0)

val words = tk.transform(df.withColumn("label", isGood('rating)))
val Array(training, test) =
cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)

val classifier = new LogisticRegression()

training.show(10)

val simpleModel = classifier.fit(training)
simpleModel.evaluate(test).predictions.select("words", "label",
"prediction", "probability").show(10)


And the error I get is below.

16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 9)
scala.MatchError:
[null,1.0,(257358,[0,1,2,3,4,5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,208,219,221,235,249,255,260,353,355,371,431,442,641,711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]
(of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)

Many thanks
-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*

Re: Logistic Regression Match Error

Posted by Meeraj Kunnumpurath <me...@servicesymphony.com>.

Thank you, it was the escape character, option("escape", "\"")

Regards

On Sat, Nov 19, 2016 at 11:10 PM, Meeraj Kunnumpurath <
meeraj@servicesymphony.com> wrote:

> I triied .option("quote", "\""), which I believe is the default, still the
> same error. This is the offending record.
>
> Primo 4-In-1 Soft Seat Toilet Trainer and Step Stool White with Pastel
> Blue Seat,"I chose this potty for my son because of the good reviews. I do
> not like it. I'm honestly baffled by all the great reviews now that I have
> this thing in front of me.1)It is made of cheap material, feels flimsy, the
> grips on the bottom of the thing do nothing to keep it in place when the
> child sits on it.2)It comes apart into 5 or 6 different pieces and all my
> son likes to do is take it apart. I did not want a potty that would turn
> into a toy, and this has just become like a puzzle for him, with all the
> different pieces.3)It is a little big for him. He is young still but he's a
> big boy for his age. I looked at one of the pictures posted and he looks
> about the same size as the curly haired kid reading the book, but the potty
> in that picture is NOT this potty! This one is a little bigger and he can't
> get quite touch his feet on the ground, which is important.4)And one final
> thing, maybe most importantly, the ""soft"" seat is not so soft. Doesn't
> seem very comfortable to me. It's just plastic on top of plastic... and
> after my son sits on it for just a few minutes his butt has horrible red
> marks all over it! Definitely not comfortable.So, overall, i'm not
> impressed at all.I gave it 2 stars because... it gets the job done I
> suppose, and for a child a little bit older than my son it might fit a
> little better. Also I really liked the idea that it was 4-in-1.Overall
> though, I do not suggest getting this potty. Look elseware!It's probably
> best to actually go to a store and look at them first hand, and not order
> online. That's what I should have done in the first place.",2
>
> On Sat, Nov 19, 2016 at 10:59 PM, Meeraj Kunnumpurath <
> meeraj@servicesymphony.com> wrote:
>
>> Digging through it looks like an issue with reading CSV. Some of the data
>> have embedded commas in them, these fields are rightly quoted. However, the
>> CSV reader seems to be getting to a pickle, when the records contain quoted
>> and unquoted data. Fields are only quoted, when there are commas within the
>> fields, otherwise they are unquoted.
>>
>> Regards
>> Meeraj
>>
>> On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
>> meeraj@servicesymphony.com> wrote:
>>
>>> Hello,
>>>
>>> I have the following code that trains a mapping of review text to
>>> ratings. I use a tokenizer to get all the words from the review, and use a
>>> count vectorizer to get all the words. However, when I train the classifier
>>> I get a match error. Any pointers will be very helpful.
>>>
>>> The code is below,
>>>
>>> val spark = SparkSession.builder().appName("Logistic Regression").master("local").getOrCreate()
>>> import spark.implicits._
>>>
>>> val df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/amazon_baby.csv")
>>> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
>>> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>>>
>>> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>>>
>>> val words = tk.transform(df.withColumn("label", isGood('rating)))
>>> val Array(training, test) = cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>>>
>>> val classifier = new LogisticRegression()
>>>
>>> training.show(10)
>>>
>>> val simpleModel = classifier.fit(training)
>>> simpleModel.evaluate(test).predictions.select("words", "label", "prediction", "probability").show(10)
>>>
>>>
>>> And the error I get is below.
>>>
>>> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0
>>> (TID 9)
>>> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,5
>>> ,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,
>>> 68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,2
>>> 08,219,221,235,249,255,260,353,355,371,431,442,641,711,972,
>>> 1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,7294,
>>> 8951,9758,12203,18319,21779,48525,72732,75420,146476,
>>> 192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.0
>>> ,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0
>>> ,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0
>>> ,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
>>> ,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
>>> ,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>>> at org.apache.spark.ml.classification.LogisticRegression$$anonf
>>> un$6.apply(LogisticRegression.scala:266)
>>> at org.apache.spark.ml.classification.LogisticRegression$$anonf
>>> un$6.apply(LogisticRegression.scala:266)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsVal
>>> ues(MemoryStore.scala:214)
>>> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator
>>> $1.apply(BlockManager.scala:919)
>>> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator
>>> $1.apply(BlockManager.scala:910)
>>> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>>> at org.apache.spark.storage.BlockManager.doPutIterator(BlockMan
>>> ager.scala:910)
>>> at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockM
>>> anager.scala:668)
>>> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>>>
>>> Many thanks
>>> --
>>> *Meeraj Kunnumpurath*
>>>
>>>
>>> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>>>
>>> *00 971 50 409 0169meeraj@servicesymphony.com
>>> <me...@servicesymphony.com>*
>>>
>>
>>
>>
>> --
>> *Meeraj Kunnumpurath*
>>
>>
>> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>>
>> *00 971 50 409 0169meeraj@servicesymphony.com
>> <me...@servicesymphony.com>*
>>
>
>
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*

Re: Logistic Regression Match Error

Posted by Meeraj Kunnumpurath <me...@servicesymphony.com>.

I triied .option("quote", "\""), which I believe is the default, still the
same error. This is the offending record.

Primo 4-In-1 Soft Seat Toilet Trainer and Step Stool White with Pastel Blue
Seat,"I chose this potty for my son because of the good reviews. I do not
like it. I'm honestly baffled by all the great reviews now that I have this
thing in front of me.1)It is made of cheap material, feels flimsy, the
grips on the bottom of the thing do nothing to keep it in place when the
child sits on it.2)It comes apart into 5 or 6 different pieces and all my
son likes to do is take it apart. I did not want a potty that would turn
into a toy, and this has just become like a puzzle for him, with all the
different pieces.3)It is a little big for him. He is young still but he's a
big boy for his age. I looked at one of the pictures posted and he looks
about the same size as the curly haired kid reading the book, but the potty
in that picture is NOT this potty! This one is a little bigger and he can't
get quite touch his feet on the ground, which is important.4)And one final
thing, maybe most importantly, the ""soft"" seat is not so soft. Doesn't
seem very comfortable to me. It's just plastic on top of plastic... and
after my son sits on it for just a few minutes his butt has horrible red
marks all over it! Definitely not comfortable.So, overall, i'm not
impressed at all.I gave it 2 stars because... it gets the job done I
suppose, and for a child a little bit older than my son it might fit a
little better. Also I really liked the idea that it was 4-in-1.Overall
though, I do not suggest getting this potty. Look elseware!It's probably
best to actually go to a store and look at them first hand, and not order
online. That's what I should have done in the first place.",2

On Sat, Nov 19, 2016 at 10:59 PM, Meeraj Kunnumpurath <
meeraj@servicesymphony.com> wrote:

> Digging through it looks like an issue with reading CSV. Some of the data
> have embedded commas in them, these fields are rightly quoted. However, the
> CSV reader seems to be getting to a pickle, when the records contain quoted
> and unquoted data. Fields are only quoted, when there are commas within the
> fields, otherwise they are unquoted.
>
> Regards
> Meeraj
>
> On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
> meeraj@servicesymphony.com> wrote:
>
>> Hello,
>>
>> I have the following code that trains a mapping of review text to
>> ratings. I use a tokenizer to get all the words from the review, and use a
>> count vectorizer to get all the words. However, when I train the classifier
>> I get a match error. Any pointers will be very helpful.
>>
>> The code is below,
>>
>> val spark = SparkSession.builder().appName("Logistic Regression").master("local").getOrCreate()
>> import spark.implicits._
>>
>> val df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/amazon_baby.csv")
>> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
>> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>>
>> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>>
>> val words = tk.transform(df.withColumn("label", isGood('rating)))
>> val Array(training, test) = cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>>
>> val classifier = new LogisticRegression()
>>
>> training.show(10)
>>
>> val simpleModel = classifier.fit(training)
>> simpleModel.evaluate(test).predictions.select("words", "label", "prediction", "probability").show(10)
>>
>>
>> And the error I get is below.
>>
>> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID
>> 9)
>> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,5
>> ,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,
>> 68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,
>> 208,219,221,235,249,255,260,353,355,371,431,442,641,711,
>> 972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,
>> 7294,8951,9758,12203,18319,21779,48525,72732,75420,146476
>> ,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.
>> 0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.
>> 0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.
>> 0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.
>> 0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.
>> 0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>> at org.apache.spark.ml.classification.LogisticRegression$$
>> anonfun$6.apply(LogisticRegression.scala:266)
>> at org.apache.spark.ml.classification.LogisticRegression$$
>> anonfun$6.apply(LogisticRegression.scala:266)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsVal
>> ues(MemoryStore.scala:214)
>> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator
>> $1.apply(BlockManager.scala:919)
>> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator
>> $1.apply(BlockManager.scala:910)
>> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>> at org.apache.spark.storage.BlockManager.doPutIterator(BlockMan
>> ager.scala:910)
>> at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockM
>> anager.scala:668)
>> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>>
>> Many thanks
>> --
>> *Meeraj Kunnumpurath*
>>
>>
>> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>>
>> *00 971 50 409 0169meeraj@servicesymphony.com
>> <me...@servicesymphony.com>*
>>
>
>
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*
>

-- 
*Meeraj Kunnumpurath*

*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*

Re: Logistic Regression Match Error

Posted by Meeraj Kunnumpurath <me...@servicesymphony.com>.

Digging through it looks like an issue with reading CSV. Some of the data
have embedded commas in them, these fields are rightly quoted. However, the
CSV reader seems to be getting to a pickle, when the records contain quoted
and unquoted data. Fields are only quoted, when there are commas within the
fields, otherwise they are unquoted.

Regards
Meeraj

On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
meeraj@servicesymphony.com> wrote:

> Hello,
>
> I have the following code that trains a mapping of review text to ratings.
> I use a tokenizer to get all the words from the review, and use a count
> vectorizer to get all the words. However, when I train the classifier I get
> a match error. Any pointers will be very helpful.
>
> The code is below,
>
> val spark = SparkSession.builder().appName("Logistic Regression").master("local").getOrCreate()
> import spark.implicits._
>
> val df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/amazon_baby.csv")
> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>
> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>
> val words = tk.transform(df.withColumn("label", isGood('rating)))
> val Array(training, test) = cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>
> val classifier = new LogisticRegression()
>
> training.show(10)
>
> val simpleModel = classifier.fit(training)
> simpleModel.evaluate(test).predictions.select("words", "label", "prediction", "probability").show(10)
>
>
> And the error I get is below.
>
> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID
> 9)
> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,
> 5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,
> 58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,
> 169,208,219,221,235,249,255,260,353,355,371,431,442,641,
> 711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,
> 6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,
> 146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,
> 2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,
> 1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(
> MemoryStore.scala:214)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:919)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
> at org.apache.spark.storage.BlockManager.doPutIterator(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.getOrElseUpdate(
> BlockManager.scala:668)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>
> Many thanks
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169meeraj@servicesymphony.com <me...@servicesymphony.com>*