You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jeff saremi <je...@hotmail.com> on 2017/06/20 21:48:06 UTC

Bizzare diff in behavior between scala REPL and sparkSQL UDF

I have this function which does a regex matching in scala. I test it in the REPL I get expected results.

I use it as a UDF in sparkSQL i get completely incorrect results.


Function:

class UrlFilter (filters: Seq[String]) extends Serializable  {
  val regexFilters = filters.map(new Regex(_))
  regexFilters.foreach(println)

  def matches(s: String) : Boolean = {
    if(s == null || s.isEmpty) return false
    regexFilters.exists(f => {print("matching " + f + " against " + s); s match {
        case f() => { println("; matched! returning true"); true };
        case _ => { println("; did NOT match. returning false"); false }
    }})
  }
}

Instantiating it with a pattern like:
^[^:]+://[^.]*\.company[0-9]*9\.com$

(matches a url that has company in the name and a number that ends in digit 9)
Test it in Scala REPL:

scala> val filters = Source.fromFile("D:\\cosmos-modules\\testdata\\fakefilters.txt").getLines.toList

scala> val urlFilter = new UrlFilter(filters)

scala>  urlFilter.matches("ftp://ftp.company9.com")
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com; matched! returning true
res2: Boolean = true


Use it in SparkSQL:

val urlFilter = new UrlFilter(filters)
sqlContext.udf.register("filterListMatch", (url: String) => urlFilter.matches(url))

val nonMatchingUrlsDf = sqlContext.sql("SELECT url FROM distinctUrls WHERE NOT filterListMatch(url)")

Look at the debug prints in the console:
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com ; did NOT match. returning false

I have repeated this several times to make sure I'm comparing apples only
I am using Spark 1.6 and Scala 2.10.5 with Java 1.8
thanks



Re: Bizzare diff in behavior between scala REPL and sparkSQL UDF

Posted by jeff saremi <je...@hotmail.com>.
never mind!

I has a space at the end of my data which was not showing up in manual testing.

thanks

________________________________
From: jeff saremi <je...@hotmail.com>
Sent: Tuesday, June 20, 2017 2:48:06 PM
To: user@spark.apache.org
Subject: Bizzare diff in behavior between scala REPL and sparkSQL UDF


I have this function which does a regex matching in scala. I test it in the REPL I get expected results.

I use it as a UDF in sparkSQL i get completely incorrect results.


Function:

class UrlFilter (filters: Seq[String]) extends Serializable  {
  val regexFilters = filters.map(new Regex(_))
  regexFilters.foreach(println)

  def matches(s: String) : Boolean = {
    if(s == null || s.isEmpty) return false
    regexFilters.exists(f => {print("matching " + f + " against " + s); s match {
        case f() => { println("; matched! returning true"); true };
        case _ => { println("; did NOT match. returning false"); false }
    }})
  }
}

Instantiating it with a pattern like:
^[^:]+://[^.]*\.company[0-9]*9\.com$

(matches a url that has company in the name and a number that ends in digit 9)
Test it in Scala REPL:

scala> val filters = Source.fromFile("D:\\cosmos-modules\\testdata\\fakefilters.txt").getLines.toList

scala> val urlFilter = new UrlFilter(filters)

scala>  urlFilter.matches("ftp://ftp.company9.com")
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com; matched! returning true
res2: Boolean = true


Use it in SparkSQL:

val urlFilter = new UrlFilter(filters)
sqlContext.udf.register("filterListMatch", (url: String) => urlFilter.matches(url))

val nonMatchingUrlsDf = sqlContext.sql("SELECT url FROM distinctUrls WHERE NOT filterListMatch(url)")

Look at the debug prints in the console:
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com ; did NOT match. returning false

I have repeated this several times to make sure I'm comparing apples only
I am using Spark 1.6 and Scala 2.10.5 with Java 1.8
thanks