You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ca...@free.fr on 2022/02/07 04:09:48 UTC

TypeError: Can not infer schema for type:

>>> rdd = sc.parallelize([3,2,1,4])
>>> rdd.toDF().show()
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
     return sparkSession.createDataFrame(self, schema, sampleRatio)
   File "/opt/spark/python/pyspark/sql/session.py", line 675, in 
createDataFrame
     return self._create_dataframe(data, schema, samplingRatio, 
verifySchema)
   File "/opt/spark/python/pyspark/sql/session.py", line 698, in 
_create_dataframe
     rdd, schema = self._createFromRDD(data.map(prepare), schema, 
samplingRatio)
   File "/opt/spark/python/pyspark/sql/session.py", line 486, in 
_createFromRDD
     struct = self._inferSchema(rdd, samplingRatio, names=schema)
   File "/opt/spark/python/pyspark/sql/session.py", line 466, in 
_inferSchema
     schema = _infer_schema(first, names=names)
   File "/opt/spark/python/pyspark/sql/types.py", line 1067, in 
_infer_schema
     raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <class 'int'>


In my pyspark why this fails? I didnt get the way.
Thanks for helps.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: TypeError: Can not infer schema for type:

Posted by Mich Talebzadeh <mi...@gmail.com>.

Absolutely

The reason this error happens is that an rdd is a one dimensional data
structure whilst a data frame has to be 2 dimensional, i.e. we have a
List[Integer] but we need List[Tuple[Integer]].


Try this


>>> rdd = sc.parallelize([3,2,1,4])

>>> df = rdd.map(lambda x: (x,)).toDF()
>>> df.printSchema()
root
 |-- _1: long (nullable = true)
>>> from pyspark.sql.functions import col
>>> df.filter((col("_1") > 2)).show()
+---+
| _1|
+---+
|  3|
|  4|
+---+

or create a dataframe with schema defined

>>> from pyspark.sql.functions import col
>>> Schema = StructType([ StructField("ID", IntegerType(), False)])
>>> df = spark.createDataFrame(sc.parallelize([3,2,1,4]).map(lambda x:
(x,)), schema = Schema)
>>> df.filter(col("ID") > 2).show()
+---+
| ID|
+---+
|  3|
|  4|
+---+


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 7 Feb 2022 at 04:42, Sean Owen <sr...@gmail.com> wrote:

> You are passing a list of primitives. It expects something like a list of
> tuples, which can each have 1 int if you like.
>
> On Sun, Feb 6, 2022, 10:10 PM <ca...@free.fr> wrote:
>
>> >>> rdd = sc.parallelize([3,2,1,4])
>> >>> rdd.toDF().show()
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>>    File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
>>      return sparkSession.createDataFrame(self, schema, sampleRatio)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 675, in
>> createDataFrame
>>      return self._create_dataframe(data, schema, samplingRatio,
>> verifySchema)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 698, in
>> _create_dataframe
>>      rdd, schema = self._createFromRDD(data.map(prepare), schema,
>> samplingRatio)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 486, in
>> _createFromRDD
>>      struct = self._inferSchema(rdd, samplingRatio, names=schema)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 466, in
>> _inferSchema
>>      schema = _infer_schema(first, names=names)
>>    File "/opt/spark/python/pyspark/sql/types.py", line 1067, in
>> _infer_schema
>>      raise TypeError("Can not infer schema for type: %s" % type(row))
>> TypeError: Can not infer schema for type: <class 'int'>
>>
>>
>> In my pyspark why this fails? I didnt get the way.
>> Thanks for helps.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: TypeError: Can not infer schema for type:

Posted by ca...@free.fr.

Thanks for the reply.

It looks strange that in scala shell I can implement this translation:

scala> sc.parallelize(List(3,2,1,4)).toDF.show
+-----+
|value|
+-----+
|    3|
|    2|
|    1|
|    4|
+-----+

But in pyspark i have to write as:

>>> sc.parallelize([3,2,1,4]).map(lambda x: 
>>> (x,1)).toDF(['id','count']).show()
+---+-----+
| id|count|
+---+-----+
|  3|    1|
|  2|    1|
|  1|    1|
|  4|    1|
+---+-----+


So there are differences on the implementation of pyspark and scala.

Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: TypeError: Can not infer schema for type:

Posted by Sean Owen <sr...@gmail.com>.

You are passing a list of primitives. It expects something like a list of
tuples, which can each have 1 int if you like.

On Sun, Feb 6, 2022, 10:10 PM <ca...@free.fr> wrote:

> >>> rdd = sc.parallelize([3,2,1,4])
> >>> rdd.toDF().show()
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
>      return sparkSession.createDataFrame(self, schema, sampleRatio)
>    File "/opt/spark/python/pyspark/sql/session.py", line 675, in
> createDataFrame
>      return self._create_dataframe(data, schema, samplingRatio,
> verifySchema)
>    File "/opt/spark/python/pyspark/sql/session.py", line 698, in
> _create_dataframe
>      rdd, schema = self._createFromRDD(data.map(prepare), schema,
> samplingRatio)
>    File "/opt/spark/python/pyspark/sql/session.py", line 486, in
> _createFromRDD
>      struct = self._inferSchema(rdd, samplingRatio, names=schema)
>    File "/opt/spark/python/pyspark/sql/session.py", line 466, in
> _inferSchema
>      schema = _infer_schema(first, names=names)
>    File "/opt/spark/python/pyspark/sql/types.py", line 1067, in
> _infer_schema
>      raise TypeError("Can not infer schema for type: %s" % type(row))
> TypeError: Can not infer schema for type: <class 'int'>
>
>
> In my pyspark why this fails? I didnt get the way.
> Thanks for helps.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>