You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Franco Barrientos <fr...@exalitica.com> on 2014/12/24 16:44:24 UTC

null Error in ALS model predict

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-      Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-      Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)=> 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <ma...@exalitica.com> franco.barrientos@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

Re: null Error in ALS model predict

Posted by Burak Yavuz <by...@stanford.edu>.

Hi,

The MatrixFactorizationModel consists of two RDD's. When you use the second method, Spark tries to serialize both RDD's for the .map() function, 
which is not possible, because RDD's are not serializable. Therefore you receive the NULLPointerException. You must use the first method.

Best,
Burak

----- Original Message -----
From: "Franco Barrientos" <fr...@exalitica.com>
To: user@spark.apache.org
Sent: Wednesday, December 24, 2014 7:44:24 AM
Subject: null Error in ALS model predict

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-      Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-      Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)=> 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <ma...@exalitica.com> franco.barrientos@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

Re: How to identify erroneous input record ?

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.

Although not elegantly I got the output via my code but totally agree on the parsing 5 times (thats really bad).Will add your suggested logic and check it out. I have a "long" way to the finish line. I am re-architecting my entire hadoop code and getting it onto spark.
Check out what I do at www.medicalsidefx.orgPrimarily an iPhone app but underlying is Lucene, Hadoop and hopefully soon in 2015 - Spark :-)  
      From: Sean Owen <so...@cloudera.com>
 To: Sanjay Subramanian <sa...@yahoo.com> 
Cc: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Wednesday, December 24, 2014 8:56 AM
 Subject: Re: How to identify erroneous input record ?
   
I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =>
  val tokens = line.split('$')
  if (tokens.length >= 13) {
    val parsed = tokens(0) + "~" + tokens(5) + "~" + tokens(11) + "~"
+ tokens(12)
    Some(parsed)
  } else {
    None
  }
}



On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
<sa...@yahoo.com.invalid> wrote:
> DOH Looks like I did not have enough coffee before I asked this :-)
> I added the if statement...
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
> var demoRddFilterMap = demoRddFilter.map(line => {
>  if (line.split('$').length >= 13){
>    line.split('$')(0) + "~" + line.split('$')(5) + "~" +
> line.split('$')(11) + "~" + line.split('$')(12)
>  }
> })
>
>
> ________________________________
> From: Sanjay Subramanian <sa...@yahoo.com.INVALID>
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Sent: Wednesday, December 24, 2014 8:28 AM
> Subject: How to identify erroneous input record ?
>
> hey guys
>
> One of my input records has an problem that makes the code fail.
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
>
> var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" +
> line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))
>
> demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
>
>
> This is possibly happening because perhaps one input record may not have 13
> fields.
>
> If this were Hadoop mapper code , I have 2 ways to solve this
>
> 1. test the number of fields of each line before applying the map function
>
> 2. enclose the mapping function in a try catch block so that the mapping
> function only fails for the erroneous record
>
> How do I implement 1. or 2. in the Spark code ?
>
> Thanks
>
>
> sanjay
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to identify erroneous input record ?

Posted by Sean Owen <so...@cloudera.com>.

I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =>
  val tokens = line.split('$')
  if (tokens.length >= 13) {
    val parsed = tokens(0) + "~" + tokens(5) + "~" + tokens(11) + "~"
+ tokens(12)
    Some(parsed)
  } else {
    None
  }
}

On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
<sa...@yahoo.com.invalid> wrote:
> DOH Looks like I did not have enough coffee before I asked this :-)
> I added the if statement...
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
> var demoRddFilterMap = demoRddFilter.map(line => {
>   if (line.split('$').length >= 13){
>     line.split('$')(0) + "~" + line.split('$')(5) + "~" +
> line.split('$')(11) + "~" + line.split('$')(12)
>   }
> })
>
>
> ________________________________
> From: Sanjay Subramanian <sa...@yahoo.com.INVALID>
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Sent: Wednesday, December 24, 2014 8:28 AM
> Subject: How to identify erroneous input record ?
>
> hey guys
>
> One of my input records has an problem that makes the code fail.
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
>
> var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" +
> line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))
>
> demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
>
>
> This is possibly happening because perhaps one input record may not have 13
> fields.
>
> If this were Hadoop mapper code , I have 2 ways to solve this
>
> 1. test the number of fields of each line before applying the map function
>
> 2. enclose the mapping function in a try catch block so that the mapping
> function only fails for the erroneous record
>
> How do I implement 1. or 2. in the Spark code ?
>
> Thanks
>
>
> sanjay
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to identify erroneous input record ?

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.

DOH Looks like I did not have enough coffee before I asked this :-) I added the if statement...var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion"))
var demoRddFilterMap = demoRddFilter.map(line => {
  if (line.split('$').length >= 13){
    line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12)
  }
})

      From: Sanjay Subramanian <sa...@yahoo.com.INVALID>
 To: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?
   
hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion"))

var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
This is possibly happening because perhaps one input record may not have 13 fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous recordHow do I implement 1. or 2. in the Spark code ?Thanks
sanjay

  #yiv8750085330 #yiv8750085330 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;}#yiv8750085330 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8750085330 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8750085330 p.yiv8750085330MsoNormal, #yiv8750085330 li.yiv8750085330MsoNormal, #yiv8750085330 div.yiv8750085330MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 a:link, #yiv8750085330 span.yiv8750085330MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv8750085330 a:visited, #yiv8750085330 span.yiv8750085330MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv8750085330 p.yiv8750085330MsoListParagraph, #yiv8750085330 li.yiv8750085330MsoListParagraph, #yiv8750085330 div.yiv8750085330MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 span.yiv8750085330EstiloCorreo17 {color:windowtext;}#yiv8750085330 .yiv8750085330MsoChpDefault {}#yiv8750085330 filtered {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv8750085330 div.yiv8750085330WordSection1 {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 ol {margin-bottom:0cm;}#yiv8750085330 ul {margin-bottom:0cm;}#yiv8750085330

How to identify erroneous input record ?

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.

hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion"))

var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
This is possibly happening because perhaps one input record may not have 13 fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous recordHow do I implement 1. or 2. in the Spark code ?Thanks
sanjay  <!--#yiv7202296517 _filtered #yiv7202296517 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7202296517 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv7202296517 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv7202296517 #yiv7202296517 p.yiv7202296517MsoNormal, #yiv7202296517 li.yiv7202296517MsoNormal, #yiv7202296517 div.yiv7202296517MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv7202296517 a:link, #yiv7202296517 span.yiv7202296517MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv7202296517 a:visited, #yiv7202296517 span.yiv7202296517MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv7202296517 p.yiv7202296517MsoListParagraph, #yiv7202296517 li.yiv7202296517MsoListParagraph, #yiv7202296517 div.yiv7202296517MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv7202296517 span.yiv7202296517EstiloCorreo17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv7202296517 .yiv7202296517MsoChpDefault {font-family:"Calibri", sans-serif;} _filtered #yiv7202296517 {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv7202296517 div.yiv7202296517WordSection1 {}#yiv7202296517 _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {}#yiv7202296517 ol {margin-bottom:0cm;}#yiv7202296517 ul {margin-bottom:0cm;}-->