You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jeff saremi <je...@hotmail.com> on 2017/06/20 21:48:06 UTC
Bizzare diff in behavior between scala REPL and sparkSQL UDF
I have this function which does a regex matching in scala. I test it in the REPL I get expected results.
I use it as a UDF in sparkSQL i get completely incorrect results.
Function:
class UrlFilter (filters: Seq[String]) extends Serializable {
val regexFilters = filters.map(new Regex(_))
regexFilters.foreach(println)
def matches(s: String) : Boolean = {
if(s == null || s.isEmpty) return false
regexFilters.exists(f => {print("matching " + f + " against " + s); s match {
case f() => { println("; matched! returning true"); true };
case _ => { println("; did NOT match. returning false"); false }
}})
}
}
Instantiating it with a pattern like:
^[^:]+://[^.]*\.company[0-9]*9\.com$
(matches a url that has company in the name and a number that ends in digit 9)
Test it in Scala REPL:
scala> val filters = Source.fromFile("D:\\cosmos-modules\\testdata\\fakefilters.txt").getLines.toList
scala> val urlFilter = new UrlFilter(filters)
scala> urlFilter.matches("ftp://ftp.company9.com")
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com; matched! returning true
res2: Boolean = true
Use it in SparkSQL:
val urlFilter = new UrlFilter(filters)
sqlContext.udf.register("filterListMatch", (url: String) => urlFilter.matches(url))
val nonMatchingUrlsDf = sqlContext.sql("SELECT url FROM distinctUrls WHERE NOT filterListMatch(url)")
Look at the debug prints in the console:
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com ; did NOT match. returning false
I have repeated this several times to make sure I'm comparing apples only
I am using Spark 1.6 and Scala 2.10.5 with Java 1.8
thanks
Re: Bizzare diff in behavior between scala REPL and sparkSQL UDF
Posted by jeff saremi <je...@hotmail.com>.
never mind!
I has a space at the end of my data which was not showing up in manual testing.
thanks
________________________________
From: jeff saremi <je...@hotmail.com>
Sent: Tuesday, June 20, 2017 2:48:06 PM
To: user@spark.apache.org
Subject: Bizzare diff in behavior between scala REPL and sparkSQL UDF
I have this function which does a regex matching in scala. I test it in the REPL I get expected results.
I use it as a UDF in sparkSQL i get completely incorrect results.
Function:
class UrlFilter (filters: Seq[String]) extends Serializable {
val regexFilters = filters.map(new Regex(_))
regexFilters.foreach(println)
def matches(s: String) : Boolean = {
if(s == null || s.isEmpty) return false
regexFilters.exists(f => {print("matching " + f + " against " + s); s match {
case f() => { println("; matched! returning true"); true };
case _ => { println("; did NOT match. returning false"); false }
}})
}
}
Instantiating it with a pattern like:
^[^:]+://[^.]*\.company[0-9]*9\.com$
(matches a url that has company in the name and a number that ends in digit 9)
Test it in Scala REPL:
scala> val filters = Source.fromFile("D:\\cosmos-modules\\testdata\\fakefilters.txt").getLines.toList
scala> val urlFilter = new UrlFilter(filters)
scala> urlFilter.matches("ftp://ftp.company9.com")
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com; matched! returning true
res2: Boolean = true
Use it in SparkSQL:
val urlFilter = new UrlFilter(filters)
sqlContext.udf.register("filterListMatch", (url: String) => urlFilter.matches(url))
val nonMatchingUrlsDf = sqlContext.sql("SELECT url FROM distinctUrls WHERE NOT filterListMatch(url)")
Look at the debug prints in the console:
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com ; did NOT match. returning false
I have repeated this several times to make sure I'm comparing apples only
I am using Spark 1.6 and Scala 2.10.5 with Java 1.8
thanks