You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eric Walker <er...@node.io> on 2015/08/17 21:53:22 UTC

registering an empty RDD as a temp table in a PySpark SQL context

I have an RDD queried from a scan of a data source.  Sometimes the RDD has
rows and at other times it has none.  I would like to register this RDD as
a temporary table in a SQL context.  I suspect this will work in Scala, but
in PySpark some code assumes that the RDD has rows in it, which are used to
verify the schema:

https://github.com/apache/spark/blob/branch-1.3/python/pyspark/sql/context.py#L299

Before I attempt to extend the Scala code to handle an empty RDD or provide
an empty DataFrame that can be registered, I was wondering what people
recommend in this case.  Perhaps there's a simple way of registering an
empty RDD as a temporary table in a PySpark SQL context that I'm
overlooking.

An alternative is to add special case logic in the client code to deal with
an RDD backed by an empty table scan.  But since the SQL will already
handle that, I was hoping to avoid special case logic.

Eric

Re: registering an empty RDD as a temp table in a PySpark SQL context

Posted by Hemant Bhanawat <he...@gmail.com>.

It is definitely not the case for Spark SQL. A temporary table (much like
dataFrame) is a just a logical plan with a name and it is not iterated
unless a query is fired on it.

I am not sure if using rdd.take in py code to verify the schema is a right
approach as it creates a spark job.

BTW, why would you want to update the Spark code? rdd.take in py code is
the problem. All you want is to avoid the schema verification in the
createDataFrame. I do not see any issue in the spark side in the way it
handles a RDD that has no data.

On Tue, Aug 18, 2015 at 1:23 AM, Eric Walker <er...@node.io> wrote:

> I have an RDD queried from a scan of a data source.  Sometimes the RDD has
> rows and at other times it has none.  I would like to register this RDD as
> a temporary table in a SQL context.  I suspect this will work in Scala, but
> in PySpark some code assumes that the RDD has rows in it, which are used to
> verify the schema:
>
>
> https://github.com/apache/spark/blob/branch-1.3/python/pyspark/sql/context.py#L299
>
> Before I attempt to extend the Scala code to handle an empty RDD or
> provide an empty DataFrame that can be registered, I was wondering what
> people recommend in this case.  Perhaps there's a simple way of registering
> an empty RDD as a temporary table in a PySpark SQL context that I'm
> overlooking.
>
> An alternative is to add special case logic in the client code to deal
> with an RDD backed by an empty table scan.  But since the SQL will already
> handle that, I was hoping to avoid special case logic.
>
> Eric
>
>