You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sanjay Subramanian <sa...@yahoo.com.INVALID> on 2014/12/24 17:28:01 UTC

How to identify erroneous input record ?

hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion"))

var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
This is possibly happening because perhaps one input record may not have 13 fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous recordHow do I implement 1. or 2. in the Spark code ?Thanks
sanjay  <!--#yiv7202296517 _filtered #yiv7202296517 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7202296517 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv7202296517 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv7202296517 #yiv7202296517 p.yiv7202296517MsoNormal, #yiv7202296517 li.yiv7202296517MsoNormal, #yiv7202296517 div.yiv7202296517MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv7202296517 a:link, #yiv7202296517 span.yiv7202296517MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv7202296517 a:visited, #yiv7202296517 span.yiv7202296517MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv7202296517 p.yiv7202296517MsoListParagraph, #yiv7202296517 li.yiv7202296517MsoListParagraph, #yiv7202296517 div.yiv7202296517MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv7202296517 span.yiv7202296517EstiloCorreo17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv7202296517 .yiv7202296517MsoChpDefault {font-family:"Calibri", sans-serif;} _filtered #yiv7202296517 {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv7202296517 div.yiv7202296517WordSection1 {}#yiv7202296517 _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {}#yiv7202296517 ol {margin-bottom:0cm;}#yiv7202296517 ul {margin-bottom:0cm;}-->

Re: How to identify erroneous input record ?

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.

Although not elegantly I got the output via my code but totally agree on the parsing 5 times (thats really bad).Will add your suggested logic and check it out. I have a "long" way to the finish line. I am re-architecting my entire hadoop code and getting it onto spark.
Check out what I do at www.medicalsidefx.orgPrimarily an iPhone app but underlying is Lucene, Hadoop and hopefully soon in 2015 - Spark :-)  
      From: Sean Owen <so...@cloudera.com>
 To: Sanjay Subramanian <sa...@yahoo.com> 
Cc: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Wednesday, December 24, 2014 8:56 AM
 Subject: Re: How to identify erroneous input record ?
   
I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =>
  val tokens = line.split('$')
  if (tokens.length >= 13) {
    val parsed = tokens(0) + "~" + tokens(5) + "~" + tokens(11) + "~"
+ tokens(12)
    Some(parsed)
  } else {
    None
  }
}



On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
<sa...@yahoo.com.invalid> wrote:
> DOH Looks like I did not have enough coffee before I asked this :-)
> I added the if statement...
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
> var demoRddFilterMap = demoRddFilter.map(line => {
>  if (line.split('$').length >= 13){
>    line.split('$')(0) + "~" + line.split('$')(5) + "~" +
> line.split('$')(11) + "~" + line.split('$')(12)
>  }
> })
>
>
> ________________________________
> From: Sanjay Subramanian <sa...@yahoo.com.INVALID>
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Sent: Wednesday, December 24, 2014 8:28 AM
> Subject: How to identify erroneous input record ?
>
> hey guys
>
> One of my input records has an problem that makes the code fail.
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
>
> var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" +
> line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))
>
> demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
>
>
> This is possibly happening because perhaps one input record may not have 13
> fields.
>
> If this were Hadoop mapper code , I have 2 ways to solve this
>
> 1. test the number of fields of each line before applying the map function
>
> 2. enclose the mapping function in a try catch block so that the mapping
> function only fails for the erroneous record
>
> How do I implement 1. or 2. in the Spark code ?
>
> Thanks
>
>
> sanjay
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to identify erroneous input record ?

Posted by Sean Owen <so...@cloudera.com>.

I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =>
  val tokens = line.split('$')
  if (tokens.length >= 13) {
    val parsed = tokens(0) + "~" + tokens(5) + "~" + tokens(11) + "~"
+ tokens(12)
    Some(parsed)
  } else {
    None
  }
}

On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
<sa...@yahoo.com.invalid> wrote:
> DOH Looks like I did not have enough coffee before I asked this :-)
> I added the if statement...
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
> var demoRddFilterMap = demoRddFilter.map(line => {
>   if (line.split('$').length >= 13){
>     line.split('$')(0) + "~" + line.split('$')(5) + "~" +
> line.split('$')(11) + "~" + line.split('$')(12)
>   }
> })
>
>
> ________________________________
> From: Sanjay Subramanian <sa...@yahoo.com.INVALID>
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Sent: Wednesday, December 24, 2014 8:28 AM
> Subject: How to identify erroneous input record ?
>
> hey guys
>
> One of my input records has an problem that makes the code fail.
>
> var demoRddFilter = demoRdd.filter(line =>
> !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") ||
> !line.contains("primaryid$caseid$caseversion"))
>
> var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" +
> line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))
>
> demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
>
>
> This is possibly happening because perhaps one input record may not have 13
> fields.
>
> If this were Hadoop mapper code , I have 2 ways to solve this
>
> 1. test the number of fields of each line before applying the map function
>
> 2. enclose the mapping function in a try catch block so that the mapping
> function only fails for the erroneous record
>
> How do I implement 1. or 2. in the Spark code ?
>
> Thanks
>
>
> sanjay
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to identify erroneous input record ?

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.

DOH Looks like I did not have enough coffee before I asked this :-) I added the if statement...var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion"))
var demoRddFilterMap = demoRddFilter.map(line => {
  if (line.split('$').length >= 13){
    line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12)
  }
})

      From: Sanjay Subramanian <sa...@yahoo.com.INVALID>
 To: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?
   
hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion"))

var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12))demoRddFilterMap.saveAsTextFile("/data/aers/msfx/demo/" + outFile)
This is possibly happening because perhaps one input record may not have 13 fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous recordHow do I implement 1. or 2. in the Spark code ?Thanks
sanjay

  #yiv8750085330 #yiv8750085330 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;}#yiv8750085330 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8750085330 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8750085330 p.yiv8750085330MsoNormal, #yiv8750085330 li.yiv8750085330MsoNormal, #yiv8750085330 div.yiv8750085330MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 a:link, #yiv8750085330 span.yiv8750085330MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv8750085330 a:visited, #yiv8750085330 span.yiv8750085330MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv8750085330 p.yiv8750085330MsoListParagraph, #yiv8750085330 li.yiv8750085330MsoListParagraph, #yiv8750085330 div.yiv8750085330MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 span.yiv8750085330EstiloCorreo17 {color:windowtext;}#yiv8750085330 .yiv8750085330MsoChpDefault {}#yiv8750085330 filtered {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv8750085330 div.yiv8750085330WordSection1 {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 ol {margin-bottom:0cm;}#yiv8750085330 ul {margin-bottom:0cm;}#yiv8750085330