You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ca...@free.fr on 2022/02/06 11:50:59 UTC

dataframe doesn't support higher order func, right?

for example, this work for RDD object:

scala> val li = List(3,2,1,4,0)
li: List[Int] = List(3, 2, 1, 4, 0)

scala> val rdd = sc.parallelize(li)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at <console>:24

scala> rdd.filter(_ > 2).collect()
res0: Array[Int] = Array(3, 4)


After I convert RDD to the dataframe, the filter won't work:

scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.filter(_ > 2).show()
<console>:24: error: value > is not a member of org.apache.spark.sql.Row
        df.filter(_ > 2).show()


But this can work:

scala> df.filter($"value" > 2).show()
+-----+
|value|
+-----+
|    3|
|    4|
+-----+


Where to check all the methods supported by dataframe?


Thank you.
Frakass


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dataframe doesn't support higher order func, right?

Posted by Sean Owen <sr...@gmail.com>.

DataFrames are a quite different API, more SQL-like in its operations, not
functional. The equivalent would be more like df.filterExpr("value > 2")

On Sun, Feb 6, 2022 at 5:51 AM <ca...@free.fr> wrote:

> for example, this work for RDD object:
>
> scala> val li = List(3,2,1,4,0)
> li: List[Int] = List(3, 2, 1, 4, 0)
>
> scala> val rdd = sc.parallelize(li)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
> parallelize at <console>:24
>
> scala> rdd.filter(_ > 2).collect()
> res0: Array[Int] = Array(3, 4)
>
>
> After I convert RDD to the dataframe, the filter won't work:
>
> scala> val df = rdd.toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
>
> scala> df.filter(_ > 2).show()
> <console>:24: error: value > is not a member of org.apache.spark.sql.Row
>         df.filter(_ > 2).show()
>
>
> But this can work:
>
> scala> df.filter($"value" > 2).show()
> +-----+
> |value|
> +-----+
> |    3|
> |    4|
> +-----+
>
>
> Where to check all the methods supported by dataframe?
>
>
> Thank you.
> Frakass
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: dataframe doesn't support higher order func, right?

Posted by Sean Owen <sr...@gmail.com>.

Scala and Python are not the same in this regard. This isn't related to how
spark works.

On Sun, Feb 6, 2022, 10:04 PM <ca...@free.fr> wrote:

> Indeed. in spark-shell I ignore the parentheses always,
>
> scala> sc.parallelize(List(3,2,1,4)).toDF.show
> +-----+
> |value|
> +-----+
> |    3|
> |    2|
> |    1|
> |    4|
> +-----+
>
> So I think it would be ok in pyspark.
>
> But this still doesn't work. why?
>
> >>> sc.parallelize([3,2,1,4]).toDF().show()
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
>      return sparkSession.createDataFrame(self, schema, sampleRatio)
>    File "/opt/spark/python/pyspark/sql/session.py", line 675, in
> createDataFrame
>      return self._create_dataframe(data, schema, samplingRatio,
> verifySchema)
>    File "/opt/spark/python/pyspark/sql/session.py", line 698, in
> _create_dataframe
>      rdd, schema = self._createFromRDD(data.map(prepare), schema,
> samplingRatio)
>    File "/opt/spark/python/pyspark/sql/session.py", line 486, in
> _createFromRDD
>      struct = self._inferSchema(rdd, samplingRatio, names=schema)
>    File "/opt/spark/python/pyspark/sql/session.py", line 466, in
> _inferSchema
>      schema = _infer_schema(first, names=names)
>    File "/opt/spark/python/pyspark/sql/types.py", line 1067, in
> _infer_schema
>      raise TypeError("Can not infer schema for type: %s" % type(row))
> TypeError: Can not infer schema for type: <class 'int'>
>
>
> spark 3.2.0
>
>
> On 07/02/2022 11:44, Sean Owen wrote:
> > This is just basic Python - you're missing parentheses on toDF, so you
> > are not calling a function nor getting its result.
> >
> > On Sun, Feb 6, 2022 at 9:39 PM <ca...@free.fr> wrote:
> >
> >> I am a bit confused why in pyspark this doesn't work?
> >>
> >>>>> x = sc.parallelize([3,2,1,4])
> >>>>> x.toDF.show()
> >> Traceback (most recent call last):
> >> File "<stdin>", line 1, in <module>
> >> AttributeError: 'function' object has no attribute 'show'
> >>
> >> Thank you.
> >>
> >>
> > ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: dataframe doesn't support higher order func, right?

Posted by ca...@free.fr.

Indeed. in spark-shell I ignore the parentheses always,

scala> sc.parallelize(List(3,2,1,4)).toDF.show
+-----+
|value|
+-----+
|    3|
|    2|
|    1|
|    4|
+-----+

So I think it would be ok in pyspark.

But this still doesn't work. why?

>>> sc.parallelize([3,2,1,4]).toDF().show()
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
     return sparkSession.createDataFrame(self, schema, sampleRatio)
   File "/opt/spark/python/pyspark/sql/session.py", line 675, in 
createDataFrame
     return self._create_dataframe(data, schema, samplingRatio, 
verifySchema)
   File "/opt/spark/python/pyspark/sql/session.py", line 698, in 
_create_dataframe
     rdd, schema = self._createFromRDD(data.map(prepare), schema, 
samplingRatio)
   File "/opt/spark/python/pyspark/sql/session.py", line 486, in 
_createFromRDD
     struct = self._inferSchema(rdd, samplingRatio, names=schema)
   File "/opt/spark/python/pyspark/sql/session.py", line 466, in 
_inferSchema
     schema = _infer_schema(first, names=names)
   File "/opt/spark/python/pyspark/sql/types.py", line 1067, in 
_infer_schema
     raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <class 'int'>


spark 3.2.0


On 07/02/2022 11:44, Sean Owen wrote:
> This is just basic Python - you're missing parentheses on toDF, so you
> are not calling a function nor getting its result.
> 
> On Sun, Feb 6, 2022 at 9:39 PM <ca...@free.fr> wrote:
> 
>> I am a bit confused why in pyspark this doesn't work?
>> 
>>>>> x = sc.parallelize([3,2,1,4])
>>>>> x.toDF.show()
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> AttributeError: 'function' object has no attribute 'show'
>> 
>> Thank you.
>> 
>> 
> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dataframe doesn't support higher order func, right?

Posted by Sean Owen <sr...@gmail.com>.

This is just basic Python - you're missing parentheses on toDF, so you are
not calling a function nor getting its result.

On Sun, Feb 6, 2022 at 9:39 PM <ca...@free.fr> wrote:

> I am a bit confused why in pyspark this doesn't work?
>
> >>> x = sc.parallelize([3,2,1,4])
> >>> x.toDF.show()
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> AttributeError: 'function' object has no attribute 'show'
>
>
> Thank you.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: dataframe doesn't support higher order func, right?

Posted by ca...@free.fr.

I am a bit confused why in pyspark this doesn't work?

>>> x = sc.parallelize([3,2,1,4])
>>> x.toDF.show()
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'show'


Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dataframe doesn't support higher order func, right?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Basically you are creating a dataframe (a dataframe is a *Dataset* organized
into named columns. It is conceptually equivalent to a table in a
relational database) out of RDD here.


scala> val rdd = sc.parallelize( List(3, 2, 1, 4, 0))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at
parallelize at <console>:24


scala> // convert it to a dataframe


scala> val df = rdd.toDF

df: org.apache.spark.sql.DataFrame = [value: int]


scala> df.filter('value > 2).show

+-----+

|value|

+-----+

|    3|

|    4|

+-----+

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 6 Feb 2022 at 11:51, <ca...@free.fr> wrote:

> for example, this work for RDD object:
>
> scala> val li = List(3,2,1,4,0)
> li: List[Int] = List(3, 2, 1, 4, 0)
>
> scala> val rdd = sc.parallelize(li)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
> parallelize at <console>:24
>
> scala> rdd.filter(_ > 2).collect()
> res0: Array[Int] = Array(3, 4)
>
>
> After I convert RDD to the dataframe, the filter won't work:
>
> scala> val df = rdd.toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
>
> scala> df.filter(_ > 2).show()
> <console>:24: error: value > is not a member of org.apache.spark.sql.Row
>         df.filter(_ > 2).show()
>
>
> But this can work:
>
> scala> df.filter($"value" > 2).show()
> +-----+
> |value|
> +-----+
> |    3|
> |    4|
> +-----+
>
>
> Where to check all the methods supported by dataframe?
>
>
> Thank you.
> Frakass
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>