You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eric Walker <er...@node.io> on 2015/08/12 00:20:33 UTC

adding a custom Scala RDD for use in PySpark

Hi,

I'm new to Scala, Spark and PySpark and have a question about what approach
to take in the problem I'm trying to solve.

I have noticed that working with HBase tables read in using
`newAPIHadoopRDD` can be quite slow with large data sets when one is
interested in only a small subset of the keyspace.  A prefix scan on the
underlying HBase table in this case takes 11 seconds, while a filter
applied to the full RDD returned by `newAPIHadoopRDD` takes 33 minutes.

I looked around and found no way to specify a prefix scan from the Python
interface.  So I have written a Scala class that can be passed an argument,
which then constructs a scan object, calls `newAPIHadoopRDD` from Scala
with the scan object and feeds the resulting RDD back to Python.

It took a few twists and turns to get this to work.  A final challenge was
the fact that `org.apache.spark.api.python.SerDeUtil` is private.  This
suggests to me that I'm doing something wrong, although I got it to work
with sufficient hackery.

What do people recommend for a general approach in getting PySpark RDDs
from HBase prefix scans?  I hope I have not missed something obvious.

Eric