You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Linh Tran <li...@apple.com> on 2016/07/13 23:43:18 UTC

Doing record linkage using string comparators in Spark

Hi guys,

I'm hoping that someone can help me to make my setup more efficient. I'm trying to do record linkage across 2.5 billion records and have set myself up in Spark to handle the data. Right as of now, I'm relying on R (with the stringdist and RecordLinkage packages) to do the actual linkage. I'll transfer small batches in at a time from Spark to R, do the linkage, and send the resulting ID’s linking the records back. What I'd like to do is set up Spark so that the string distance measures (that I'm using in R) can be directly computed in Spark, thereby avoiding the data transfers. How can I go about doing this? An example of my data is provided below, along with example R code that I'm using below that (nb. “myMatchingFunction” calls functions from the stringdist and RecordLinkage packages). I'm open to switching to Python or Scala if I need to, or incorporating the C++ code for the string comparators into Spark. Thanks. —Linh

FIRSTNAME	LASTNAME	EMAIL	PHONE	ADDRESS	DESIRED_ID
John	Smith	johnsmith@domain.com <ap...@domain.com>		1234 Main St.	1
John	Smith		1234567		1
J 	Smith	johnsmith@domain.com <ap...@domain.com>	1234567	1234 Main Street	1
Jane	Smith		2345678		2
Jane	Smith	janesmith@domain.com <ap...@domain.com>	2345678	5678 1st Street	2
Jane	Smith			5678 First St.	2
Jane	Smith				3


uk_breakcodes = read.df(sqlContext, "LT_IDMRQ_BREAK_CODES_UK.parquet", "parquet")
cache(uk_breakcodes)
registerTempTable(uk_breakcodes, "LT_IDMRQ_BREAK_CODES_UK")
tmp <- collect(sql(sqlContext, "select * from LT_IDMRQ_BREAK_CODES_UK where ADDRBRKCD = ‘XXXXXXX' or BADDRBRKCD=‘XXXXXXX' or OPBRKCD=‘XXXXXX'"))
matched = myMatchingFunction(tmp)