You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2015/01/13 06:45:34 UTC
[jira] [Resolved] (SPARK-5138) pyspark unable to infer schema of namedtuple

     [ https://issues.apache.org/jira/browse/SPARK-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust resolved SPARK-5138.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.0

Issue resolved by pull request 3978
[https://github.com/apache/spark/pull/3978]

> pyspark unable to infer schema of namedtuple
> --------------------------------------------
>
>                 Key: SPARK-5138
>                 URL: https://issues.apache.org/jira/browse/SPARK-5138
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.2.0
>            Reporter: Gabe Mulley
>            Priority: Trivial
>             Fix For: 1.3.0
>
>
> When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error.
> Example:
> {noformat}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from collections import namedtuple
> import os
> sc = SparkContext()
> rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
> TextLine = namedtuple('TextLine', 'line length')
> tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
> tuple_rdd.take(5)  # This works
> sqlc = SQLContext(sc)
> # The following line raises an error
> schema_rdd = sqlc.inferSchema(tuple_rdd)
> {noformat}
> The error raised is:
> {noformat}
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
>     process()
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft
>     yield next(iterator)
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct
>     raise ValueError("unexpected tuple: %s" % obj)
> TypeError: not all arguments converted during string formatting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org