You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "颜发才 (Yan Facai)" <ya...@gmail.com> on 2016/10/20 08:34:03 UTC

How to iterate the element of an array in DataFrame?

Hi, I want to extract the attribute `weight` of an array, and combine them
to construct a sparse vector.

### My data is like this:

scala> mblog_tags.printSchema
root
 |-- category.firstCategory: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- weight: string (nullable = true)


scala> mblog_tags.show(false)
+--------------------------------------------------------------+
|category.firstCategory                                        |
+--------------------------------------------------------------+
|[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
|[[tagCategory_029, 0.9]]                                      |
|[[tagCategory_029, 0.8]]                                      |
+--------------------------------------------------------------+


### And expected:
Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
Vectors.sparse(100, Array(29),  Array(0.9))
Vectors.sparse(100, Array(29),  Array(0.8))

How to iterate an array in DataFrame?
Thanks.

Re: How to iterate the element of an array in DataFrame?

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.

scala> mblog_tags.dtypes
res13: Array[(String, String)] =
Array((tags,ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true)))

scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }
testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true))))

Where is wrong with the udf function `testUDF` ?





On Tue, Oct 25, 2016 at 10:41 AM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:

> Thanks, Cheng Lian.
>
> I try to use case class:
>
> scala> case class Tags (category: String, weight: String)
>
> scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }
>
> testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(
> StructField(category,StringType,true), StructField(weight,StringType,
> true)),true))))
>
>
> but it raises an ClassCastException when run:
>
> scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false)
>
> 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID
> 4)
> java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to $line58.$read$$iw$$iw$Tags
>         at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.
> apply(<console>:27)
>         at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.
> apply(<console>:27)
> ...
>
>
> Where did I do wrong?
>
>
>
>
> On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian <li...@databricks.com> wrote:
>
>> You may either use SQL function "array" and "named_struct" or define a
>> case class with expected field names.
>>
>> Cheng
>>
>> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote:
>>
>> My expectation is:
>> root
>> |-- tag: vector
>>
>> namely, I want to extract from:
>> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>> to:
>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>
>> I believe it needs two step:
>> 1. val tag2vec = {tag: Array[Structure] => Vector}
>> 2. mblog_tags.withColumn("vec", tag2vec(col("tag"))
>>
>> But, I have no idea of how to describe the Array[Structure] in the
>> DataFrame.
>>
>>
>>
>>
>>
>> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk...@163.com> wrote:
>>
>>> how about change Schema from
>>> root
>>>  |-- category.firstCategory: array (nullable = true)
>>>  |    |-- element: struct (containsNull = true)
>>>  |    |    |-- category: string (nullable = true)
>>>  |    |    |-- weight: string (nullable = true)
>>> to:
>>>
>>> root
>>>  |-- category: string (nullable = true)
>>>  |-- weight: string (nullable = true)
>>>
>>> 2016-10-21
>>> ------------------------------
>>> lk_spark
>>> ------------------------------
>>>
>>> *发件人：*颜发才(Yan Facai) <ya...@gmail.com>
>>> *发送时间：*2016-10-21 15:35
>>> *主题：*Re: How to iterate the element of an array in DataFrame?
>>> *收件人：*"user.spark"<us...@spark.apache.org>
>>> *抄送：*
>>>
>>> I don't know how to construct `array<struct<category:string,
>>> weight:string>>`.
>>> Could anyone help me?
>>>
>>> I try to get the array by :
>>> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>>>
>>> while the result is:
>>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
>>> array<struct<_1:string,_2:string>>]
>>>
>>>
>>> How to express `struct<string, string>` ?
>>>
>>>
>>>
>>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <ya...@gmail.com>
>>> wrote:
>>>
>>>> Hi, I want to extract the attribute `weight` of an array, and combine
>>>> them to construct a sparse vector.
>>>>
>>>> ### My data is like this:
>>>>
>>>> scala> mblog_tags.printSchema
>>>> root
>>>>  |-- category.firstCategory: array (nullable = true)
>>>>  |    |-- element: struct (containsNull = true)
>>>>  |    |    |-- category: string (nullable = true)
>>>>  |    |    |-- weight: string (nullable = true)
>>>>
>>>>
>>>> scala> mblog_tags.show(false)
>>>> +--------------------------------------------------------------+
>>>> |category.firstCategory                                        |
>>>> +--------------------------------------------------------------+
>>>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>>>> |[[tagCategory_029, 0.9]]                                      |
>>>> |[[tagCategory_029, 0.8]]                                      |
>>>> +--------------------------------------------------------------+
>>>>
>>>>
>>>> ### And expected:
>>>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>>> Vectors.sparse(100, Array(29),  Array(0.9))
>>>> Vectors.sparse(100, Array(29),  Array(0.8))
>>>>
>>>> How to iterate an array in DataFrame?
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: How to iterate the element of an array in DataFrame?

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.

Thanks, Cheng Lian.

I try to use case class:

scala> case class Tags (category: String, weight: String)

scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }

testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true))))


but it raises an ClassCastException when run:

scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false)

16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to $line58.$read$$iw$$iw$Tags
        at
$line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27)
        at
$line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27)
...


Where did I do wrong?




On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian <li...@databricks.com> wrote:

> You may either use SQL function "array" and "named_struct" or define a
> case class with expected field names.
>
> Cheng
>
> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote:
>
> My expectation is:
> root
> |-- tag: vector
>
> namely, I want to extract from:
> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
> to:
> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>
> I believe it needs two step:
> 1. val tag2vec = {tag: Array[Structure] => Vector}
> 2. mblog_tags.withColumn("vec", tag2vec(col("tag"))
>
> But, I have no idea of how to describe the Array[Structure] in the
> DataFrame.
>
>
>
>
>
> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk...@163.com> wrote:
>
>> how about change Schema from
>> root
>>  |-- category.firstCategory: array (nullable = true)
>>  |    |-- element: struct (containsNull = true)
>>  |    |    |-- category: string (nullable = true)
>>  |    |    |-- weight: string (nullable = true)
>> to:
>>
>> root
>>  |-- category: string (nullable = true)
>>  |-- weight: string (nullable = true)
>>
>> 2016-10-21
>> ------------------------------
>> lk_spark
>> ------------------------------
>>
>> *发件人：*颜发才(Yan Facai) <ya...@gmail.com>
>> *发送时间：*2016-10-21 15:35
>> *主题：*Re: How to iterate the element of an array in DataFrame?
>> *收件人：*"user.spark"<us...@spark.apache.org>
>> *抄送：*
>>
>> I don't know how to construct `array<struct<category:string,
>> weight:string>>`.
>> Could anyone help me?
>>
>> I try to get the array by :
>> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>>
>> while the result is:
>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
>> array<struct<_1:string,_2:string>>]
>>
>>
>> How to express `struct<string, string>` ?
>>
>>
>>
>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>>
>>> Hi, I want to extract the attribute `weight` of an array, and combine
>>> them to construct a sparse vector.
>>>
>>> ### My data is like this:
>>>
>>> scala> mblog_tags.printSchema
>>> root
>>>  |-- category.firstCategory: array (nullable = true)
>>>  |    |-- element: struct (containsNull = true)
>>>  |    |    |-- category: string (nullable = true)
>>>  |    |    |-- weight: string (nullable = true)
>>>
>>>
>>> scala> mblog_tags.show(false)
>>> +--------------------------------------------------------------+
>>> |category.firstCategory                                        |
>>> +--------------------------------------------------------------+
>>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>>> |[[tagCategory_029, 0.9]]                                      |
>>> |[[tagCategory_029, 0.8]]                                      |
>>> +--------------------------------------------------------------+
>>>
>>>
>>> ### And expected:
>>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>> Vectors.sparse(100, Array(29),  Array(0.9))
>>> Vectors.sparse(100, Array(29),  Array(0.8))
>>>
>>> How to iterate an array in DataFrame?
>>> Thanks.
>>>
>>>
>>>
>>>
>>
>
>

Re: How to iterate the element of an array in DataFrame?

Posted by Cheng Lian <li...@databricks.com>.

You may either use SQL function "array" and "named_struct" or define a 
case class with expected field names.

Cheng


On 10/21/16 2:45 AM, \u989c\u53d1\u624d(Yan Facai) wrote:
> My expectation is:
> root
> |-- tag: vector
>
> namely, I want to extract from:
> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
> to:
> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>
> I believe it needs two step:
> 1. val tag2vec = {tag: Array[Structure] => Vector}
> 2. mblog_tags.withColumn("vec", tag2vec(col("tag"))
>
> But, I have no idea of how to describe the Array[Structure] in the 
> DataFrame.
>
>
>
>
>
> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk_spark@163.com 
> <ma...@163.com>> wrote:
>
>     how about change Schema from
>     root
>      |-- category.firstCategory: array (nullable = true)
>      |    |-- element: struct (containsNull = true)
>      |    |    |-- category: string (nullable = true)
>      |    |    |-- weight: string (nullable = true)
>     to:
>     root
>      |-- category: string (nullable = true)
>      |-- weight: string (nullable = true)
>     2016-10-21
>     ------------------------------------------------------------------------
>     lk_spark
>     ------------------------------------------------------------------------
>
>         *\u53d1\u4ef6\u4eba\uff1a*\u989c\u53d1\u624d(Yan Facai) <yafc18@gmail.com
>         <ma...@gmail.com>>
>         *\u53d1\u9001\u65f6\u95f4\uff1a*2016-10-21 15:35
>         *\u4e3b\u9898\uff1a*Re: How to iterate the element of an array in DataFrame?
>         *\u6536\u4ef6\u4eba\uff1a*"user.spark"<user@spark.apache.org
>         <ma...@spark.apache.org>>
>         *\u6284\u9001\uff1a*
>         I don't know how to construct
>         `array<struct<category:string,weight:string>>`.
>         Could anyone help me?
>
>         I try to get the array by :
>         scala> mblog_tags.map(_.getSeq[(String, String)](0))
>
>         while the result is:
>         res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] =
>         [value: array<struct<_1:string,_2:string>>]
>
>
>         How to express `struct<string, string>` ?
>
>
>
>         On Thu, Oct 20, 2016 at 4:34 PM, \u989c\u53d1\u624d(Yan Facai)
>         <yafc18@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi, I want to extract the attribute `weight` of an array,
>             and combine them to construct a sparse vector.
>
>             ### My data is like this:
>
>             scala> mblog_tags.printSchema
>             root
>              |-- category.firstCategory: array (nullable = true)
>              |    |-- element: struct (containsNull = true)
>              |    |    |-- category: string (nullable = true)
>              |    |    |-- weight: string (nullable = true)
>
>
>             scala> mblog_tags.show(false)
>             +--------------------------------------------------------------+
>             |category.firstCategory |
>             +--------------------------------------------------------------+
>             |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>             |[[tagCategory_029, 0.9]]         |
>             |[[tagCategory_029, 0.8]]        |
>             +--------------------------------------------------------------+
>
>
>             ### And expected:
>             Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7))
>             Vectors.sparse(100, Array(29), Array(0.9))
>             Vectors.sparse(100, Array(29), Array(0.8))
>
>             How to iterate an array in DataFrame?
>             Thanks.
>
>
>
>
>

Re: Re: How to iterate the element of an array in DataFrame?

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.

My expectation is:
root
|-- tag: vector

namely, I want to extract from:
[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
to:
Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))

I believe it needs two step:
1. val tag2vec = {tag: Array[Structure] => Vector}
2. mblog_tags.withColumn("vec", tag2vec(col("tag"))

But, I have no idea of how to describe the Array[Structure] in the
DataFrame.





On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk...@163.com> wrote:

> how about change Schema from
> root
>  |-- category.firstCategory: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- category: string (nullable = true)
>  |    |    |-- weight: string (nullable = true)
> to:
>
> root
>  |-- category: string (nullable = true)
>  |-- weight: string (nullable = true)
>
> 2016-10-21
> ------------------------------
> lk_spark
> ------------------------------
>
> *发件人：*颜发才(Yan Facai) <ya...@gmail.com>
> *发送时间：*2016-10-21 15:35
> *主题：*Re: How to iterate the element of an array in DataFrame?
> *收件人：*"user.spark"<us...@spark.apache.org>
> *抄送：*
>
> I don't know how to construct `array<struct<category:string,
> weight:string>>`.
> Could anyone help me?
>
> I try to get the array by :
> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>
> while the result is:
> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
> array<struct<_1:string,_2:string>>]
>
>
> How to express `struct<string, string>` ?
>
>
>
> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>
>> Hi, I want to extract the attribute `weight` of an array, and combine
>> them to construct a sparse vector.
>>
>> ### My data is like this:
>>
>> scala> mblog_tags.printSchema
>> root
>>  |-- category.firstCategory: array (nullable = true)
>>  |    |-- element: struct (containsNull = true)
>>  |    |    |-- category: string (nullable = true)
>>  |    |    |-- weight: string (nullable = true)
>>
>>
>> scala> mblog_tags.show(false)
>> +--------------------------------------------------------------+
>> |category.firstCategory                                        |
>> +--------------------------------------------------------------+
>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>> |[[tagCategory_029, 0.9]]                                      |
>> |[[tagCategory_029, 0.8]]                                      |
>> +--------------------------------------------------------------+
>>
>>
>> ### And expected:
>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>> Vectors.sparse(100, Array(29),  Array(0.9))
>> Vectors.sparse(100, Array(29),  Array(0.8))
>>
>> How to iterate an array in DataFrame?
>> Thanks.
>>
>>
>>
>>
>

Re: Re: How to iterate the element of an array in DataFrame?

Posted by lk_spark <lk...@163.com>.

how about change Schema from
root
 |-- category.firstCategory: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- weight: string (nullable = true)

to:

root
 |-- category: string (nullable = true)
 |-- weight: string (nullable = true)

2016-10-21 

lk_spark 



发件人：颜发才(Yan Facai) <ya...@gmail.com>
发送时间：2016-10-21 15:35
主题：Re: How to iterate the element of an array in DataFrame?
收件人："user.spark"<us...@spark.apache.org>
抄送：

I don't know how to construct `array<struct<category:string,weight:string>>`.
Could anyone help me?


I try to get the array by :
scala> mblog_tags.map(_.getSeq[(String, String)](0))

while the result is:
res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: array<struct<_1:string,_2:string>>]




How to express `struct<string, string>` ?






On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:

Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. 



### My data is like this:

scala> mblog_tags.printSchema
root
 |-- category.firstCategory: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- weight: string (nullable = true)


scala> mblog_tags.show(false)
+--------------------------------------------------------------+
|category.firstCategory                                        |
+--------------------------------------------------------------+
|[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
|[[tagCategory_029, 0.9]]                                      |
|[[tagCategory_029, 0.8]]                                      |
+--------------------------------------------------------------+



### And expected:
Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
Vectors.sparse(100, Array(29),  Array(0.9))
Vectors.sparse(100, Array(29),  Array(0.8))


How to iterate an array in DataFrame?

Thanks.

Re: How to iterate the element of an array in DataFrame?

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.

I don't know how to construct
`array<struct<category:string,weight:string>>`.
Could anyone help me?

I try to get the array by :
scala> mblog_tags.map(_.getSeq[(String, String)](0))

while the result is:
res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
array<struct<_1:string,_2:string>>]


How to express `struct<string, string>` ?



On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:

> Hi, I want to extract the attribute `weight` of an array, and combine them
> to construct a sparse vector.
>
> ### My data is like this:
>
> scala> mblog_tags.printSchema
> root
>  |-- category.firstCategory: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- category: string (nullable = true)
>  |    |    |-- weight: string (nullable = true)
>
>
> scala> mblog_tags.show(false)
> +--------------------------------------------------------------+
> |category.firstCategory                                        |
> +--------------------------------------------------------------+
> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
> |[[tagCategory_029, 0.9]]                                      |
> |[[tagCategory_029, 0.8]]                                      |
> +--------------------------------------------------------------+
>
>
> ### And expected:
> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
> Vectors.sparse(100, Array(29),  Array(0.9))
> Vectors.sparse(100, Array(29),  Array(0.8))
>
> How to iterate an array in DataFrame?
> Thanks.
>
>
>
>