You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Haig Didizian <ha...@didizian.com> on 2016/11/08 13:36:07 UTC

how to write a substring search efficiently?

Hello!

I have two datasets -- one of short strings, one of longer strings. Some of
the longer strings contain the short strings, and I need to identify which.

What I've written is taking forever to run (pushing 11 hours on my quad
core i5 with 12 GB RAM), appearing to be CPU bound. The way I've written it
required me to enable cross joins (spark.sql.crossJoin.enabled = true),
which makes me wonder if there's a better approach.

What I've done looks something like this:


shortStrings
  .as("short")
  .joinWith(longStrings.as("long"), $"long.sequence".contains($"
short.sequence"))
  .map( joined => {
    StringMatch( joined._1.id, joined._2.id, joined._1.sequence )
  })


The dataset of short strings has about 150k rows, and the dataset of long
strings has about 300k. I assume it needs to do a cartesian product because
of the substring search? Is there a more efficient way to do this?

I'd appreciate any pointers or suggestions. Thanks!
Haig