You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Semet (JIRA)" <ji...@apache.org> on 2016/09/01 11:51:20 UTC

[jira] [Created] (SPARK-17360) PySpark can create dataframe from a Python generator

Semet created SPARK-17360:
-----------------------------

             Summary: PySpark can create dataframe from a Python generator
                 Key: SPARK-17360
                 URL: https://issues.apache.org/jira/browse/SPARK-17360
             Project: Spark
          Issue Type: Improvement
            Reporter: Semet
            Priority: Trivial


It looks like one can create a dataframe from a Python generator, which might be more efficient that by creating the list of row and use createDataframe:

{code}
>>> # On Python 3, you want to use "range" on the following line
>>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, 10000000))
>>> d  # Please note that 'd' is a generator and not a structure with the 10000000 elements.
<generator object <genexpr> at 0x7f1234b92af0>
>>> sqlContext.createDataFrame(d).take(5)
[Row(age=1, name=u'Alice-1')]
[Row(age=2, name=u'Alice-2')]
[Row(age=3, name=u'Alice-3')]
[Row(age=4, name=u'Alice-4')]
[Row(age=5, name=u'Alice-5')]
{code}

Looking at the code, there is nothing important to change in the code, only doc and unit tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org