You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2015/01/13 06:45:34 UTC
[jira] [Resolved] (SPARK-5138) pyspark unable to infer schema of
namedtuple
[ https://issues.apache.org/jira/browse/SPARK-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-5138.
-------------------------------------
Resolution: Fixed
Fix Version/s: 1.3.0
Issue resolved by pull request 3978
[https://github.com/apache/spark/pull/3978]
> pyspark unable to infer schema of namedtuple
> --------------------------------------------
>
> Key: SPARK-5138
> URL: https://issues.apache.org/jira/browse/SPARK-5138
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.2.0
> Reporter: Gabe Mulley
> Priority: Trivial
> Fix For: 1.3.0
>
>
> When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error.
> Example:
> {noformat}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from collections import namedtuple
> import os
> sc = SparkContext()
> rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
> TextLine = namedtuple('TextLine', 'line length')
> tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
> tuple_rdd.take(5) # This works
> sqlc = SQLContext(sc)
> # The following line raises an error
> schema_rdd = sqlc.inferSchema(tuple_rdd)
> {noformat}
> The error raised is:
> {noformat}
> File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
> process()
> File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
> vs = list(itertools.islice(iterator, batch))
> File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft
> yield next(iterator)
> File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct
> raise ValueError("unexpected tuple: %s" % obj)
> TypeError: not all arguments converted during string formatting
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org