You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Olivier Girardot <o....@lateral-thoughts.com> on 2016/09/13 15:08:47 UTC

Spark SQL - Applying transformation on a struct inside an array

Hi everyone,I'm currently trying to create a generic transformation mecanism on
a Dataframe to modify an arbitrary column regardless of the underlying the
schema.
It's "relatively" straightforward for complex types like struct<struct<…>> to
apply an arbitrary UDF on the column and replace the data "inside" the struct,
however I'm struggling to make it work for complex types containing arrays along
the way like struct<array<struct<…>>>.
Michael Armbrust seemed to allude on the mailing list/forum to a way of using
Encoders to do that, I'd be interested in any pointers, especially considering
that it's not possible to output any Row or GenericRowWithSchema from a UDF
(thanks to
https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L657
it seems).
To sum up, I'd like to find a way to apply a transformation on complex nested
datatypes (arrays and struct) on a Dataframe updating the value itself.
Regards,
Olivier Girardot

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Olivier Girardot <o....@lateral-thoughts.com>.
So, it seems the only way I found for now is a recursive handling of the Row
instances directly, but to do that I have to go back to RDDs, i've put together
a simple test case demonstrating the problem :
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.scalatest.{FlatSpec, Matchers}

class extends with DFInPlaceTransform FlatSpec Matchers {
val spark = SparkSession.builder().appName("local""local[*]"
).master().getOrCreate()
it should "access and mutate deeply nested arrays/structs" in {

val df = spark.read.json(spark.sparkContext.parallelize(List(
"""{"a":[{"b" : "toto" }]}""".stripMargin)))
df.show()
df.printSchema()

val result = transformInPlace("a.b", df)

result.printSchema()
result.show()

result.schema should be (df.schema)
val res = result.toJSON.take(1)
res should be("""{"a":[{"b" : TOTO" }]}""")
}

def transformInPlace(path: String, df: DataFrame): DataFrame = {
val udf = spark.udf.register("transform", (s: String) => s.toUpperCase)
val paths = path.split('.')
val root = paths.head
import org.apache.spark.sql.functions._
df.withColumn(root, udf(df(path))) // does not work of course
}
}

the three other solutions I see are * to create a dedicated Expression for
   in-place modifications of nested arrays and structs,
 * to use heavy explode/lateral views/group
   by computations, but that's bound to be inefficient
 * or to generate bytecode using the schema
   to do the nested "getRow,getSeq…" and re-create the rows once transformation
   is applied

I'd like to open an issue regarding that use case because it's not the first or
last time it comes up and I still don't see any generic solution using
Dataframes.Thanks for your time,Regards,
Olivier
 





On Fri, Sep 16, 2016 10:19 AM, Olivier Girardot o.girardot@lateral-thoughts.com
wrote:
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by
SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal
for me, I managed to make it work anyway like that :> df.withColumn("a",
struct(struct(myUDF(df("a.b.c.")))) // I didn't put back the aliases but you see
what I mean.
What I'd like to make work in essence is something like that> val someFunc :
String => String = ???> val myUDF = udf(someFunc)> df.withColumn("a.b[*].c",
myUDF(df("a.b[*].c"))) // the fact is that in order to be consistent with the
previous API, maybe I'd have to put something like a struct(array(struct(… which
would be troublesome because I'd have to parse the arbitrary input string  and
create something like "a.b[*].c" => struct(array(struct(
I realise the ambiguity implied in the kind of column expression, but it doesn't
seem for now available to cleanly update data inplace at an arbitrary depth.
I'll try to work on a PR that would make this possible, but any pointers would
be appreciated.
Regards,
Olivier.
 





On Fri, Sep 16, 2016 12:42 AM, Michael Armbrust michael@databricks.com
wrote:
Is what you are looking for a withColumn that support in place modification of
nested columns? or is it some other problem?
On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
I tried to use the RowEncoder but got stuck along the way :The main issue really
is that even if it's possible (however tedious) to pattern match generically
Row(s) and target the nested field that you need to modify, Rows being immutable
data structure without a method like a case class's copy or any kind of lens to
create a brand new object, I ended up stuck at the step "target and extract the
field to update" without any way to update the original Row with the new value.
To sum up, I tried : * using only dataframe's API itself + my udf - which works
   for nested structs as long as no arrays are along the way
 * trying to create a udf the can apply on Row and pattern
   match recursively the path I needed to explore/modify
 * trying to create a UDT - but we seem to be stuck in a
   strange middle-ground with 2.0 because some parts of the API ended up private
   while some stayed public making it impossible to use it now (I'd be glad if
   I'm mistaken)

All of these failed for me and I ended up converting the rows to JSON and update
using JSONPath which is…. something I'd like to avoid 'pretty please' 





On Thu, Sep 15, 2016 5:20 AM, Michael Allman michael@videoamp.com
wrote:
Hi Guys,
Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a
public API, but it is publicly accessible. I used it recently to correct some
bad data in a few nested columns in a dataframe. It wasn't an easy job, but it
made it possible. In my particular case I was not working with arrays.
Olivier, I'm interested in seeing what you come up with.
Thanks,
Michael

On Sep 14, 2016, at 10:44 AM, Fred Reiss <fr...@gmail.com> wrote:
+1 to this request. I talked last week with a product group within IBM that is
struggling with the same issue. It's pretty common in data cleaning applications
for data in the early stages to have nested lists or sets inconsistent or
incomplete schema information.
Fred
On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
Hi everyone,I'm currently trying to create a generic transformation mecanism on
a Dataframe to modify an arbitrary column regardless of the underlying the
schema.
It's "relatively" straightforward for complex types like struct<struct<…>> to
apply an arbitrary UDF on the column and replace the data "inside" the struct,
however I'm struggling to make it work for complex types containing arrays along
the way like struct<array<struct<…>>>.
Michael Armbrust seemed to allude on the mailing list/forum to a way of using
Encoders to do that, I'd be interested in any pointers, especially considering
that it's not possible to output any Row or GenericRowWithSchema from a UDF
(thanks to https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L657  it
seems).
To sum up, I'd like to find a way to apply a transformation on complex nested
datatypes (arrays and struct) on a Dataframe updating the value itself.
Regards,
Olivier Girardot

 



Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94
 


Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94


 

Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Olivier Girardot <o....@lateral-thoughts.com>.
So, it seems the only way I found for now is a recursive handling of the Row
instances directly, but to do that I have to go back to RDDs, i've put together
a simple test case demonstrating the problem :
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.scalatest.{FlatSpec, Matchers}

class extends with DFInPlaceTransform FlatSpec Matchers {
val spark = SparkSession.builder().appName("local""local[*]"
).master().getOrCreate()
it should "access and mutate deeply nested arrays/structs" in {

val df = spark.read.json(spark.sparkContext.parallelize(List(
"""{"a":[{"b" : "toto" }]}""".stripMargin)))
df.show()
df.printSchema()

val result = transformInPlace("a.b", df)

result.printSchema()
result.show()

result.schema should be (df.schema)
val res = result.toJSON.take(1)
res should be("""{"a":[{"b" : TOTO" }]}""")
}

def transformInPlace(path: String, df: DataFrame): DataFrame = {
val udf = spark.udf.register("transform", (s: String) => s.toUpperCase)
val paths = path.split('.')
val root = paths.head
import org.apache.spark.sql.functions._
df.withColumn(root, udf(df(path))) // does not work of course
}
}

the three other solutions I see are * to create a dedicated Expression for
   in-place modifications of nested arrays and structs,
 * to use heavy explode/lateral views/group
   by computations, but that's bound to be inefficient
 * or to generate bytecode using the schema
   to do the nested "getRow,getSeq…" and re-create the rows once transformation
   is applied

I'd like to open an issue regarding that use case because it's not the first or
last time it comes up and I still don't see any generic solution using
Dataframes.Thanks for your time,Regards,
Olivier
 





On Fri, Sep 16, 2016 10:19 AM, Olivier Girardot o.girardot@lateral-thoughts.com
wrote:
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by
SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal
for me, I managed to make it work anyway like that :> df.withColumn("a",
struct(struct(myUDF(df("a.b.c.")))) // I didn't put back the aliases but you see
what I mean.
What I'd like to make work in essence is something like that> val someFunc :
String => String = ???> val myUDF = udf(someFunc)> df.withColumn("a.b[*].c",
myUDF(df("a.b[*].c"))) // the fact is that in order to be consistent with the
previous API, maybe I'd have to put something like a struct(array(struct(… which
would be troublesome because I'd have to parse the arbitrary input string  and
create something like "a.b[*].c" => struct(array(struct(
I realise the ambiguity implied in the kind of column expression, but it doesn't
seem for now available to cleanly update data inplace at an arbitrary depth.
I'll try to work on a PR that would make this possible, but any pointers would
be appreciated.
Regards,
Olivier.
 





On Fri, Sep 16, 2016 12:42 AM, Michael Armbrust michael@databricks.com
wrote:
Is what you are looking for a withColumn that support in place modification of
nested columns? or is it some other problem?
On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
I tried to use the RowEncoder but got stuck along the way :The main issue really
is that even if it's possible (however tedious) to pattern match generically
Row(s) and target the nested field that you need to modify, Rows being immutable
data structure without a method like a case class's copy or any kind of lens to
create a brand new object, I ended up stuck at the step "target and extract the
field to update" without any way to update the original Row with the new value.
To sum up, I tried : * using only dataframe's API itself + my udf - which works
   for nested structs as long as no arrays are along the way
 * trying to create a udf the can apply on Row and pattern
   match recursively the path I needed to explore/modify
 * trying to create a UDT - but we seem to be stuck in a
   strange middle-ground with 2.0 because some parts of the API ended up private
   while some stayed public making it impossible to use it now (I'd be glad if
   I'm mistaken)

All of these failed for me and I ended up converting the rows to JSON and update
using JSONPath which is…. something I'd like to avoid 'pretty please' 





On Thu, Sep 15, 2016 5:20 AM, Michael Allman michael@videoamp.com
wrote:
Hi Guys,
Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a
public API, but it is publicly accessible. I used it recently to correct some
bad data in a few nested columns in a dataframe. It wasn't an easy job, but it
made it possible. In my particular case I was not working with arrays.
Olivier, I'm interested in seeing what you come up with.
Thanks,
Michael

On Sep 14, 2016, at 10:44 AM, Fred Reiss <fr...@gmail.com> wrote:
+1 to this request. I talked last week with a product group within IBM that is
struggling with the same issue. It's pretty common in data cleaning applications
for data in the early stages to have nested lists or sets inconsistent or
incomplete schema information.
Fred
On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
Hi everyone,I'm currently trying to create a generic transformation mecanism on
a Dataframe to modify an arbitrary column regardless of the underlying the
schema.
It's "relatively" straightforward for complex types like struct<struct<…>> to
apply an arbitrary UDF on the column and replace the data "inside" the struct,
however I'm struggling to make it work for complex types containing arrays along
the way like struct<array<struct<…>>>.
Michael Armbrust seemed to allude on the mailing list/forum to a way of using
Encoders to do that, I'd be interested in any pointers, especially considering
that it's not possible to output any Row or GenericRowWithSchema from a UDF
(thanks to https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L657  it
seems).
To sum up, I'd like to find a way to apply a transformation on complex nested
datatypes (arrays and struct) on a Dataframe updating the value itself.
Regards,
Olivier Girardot

 



Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94
 


Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94


 

Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Olivier Girardot <o....@lateral-thoughts.com>.
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by
SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal
for me, I managed to make it work anyway like that :> df.withColumn("a",
struct(struct(myUDF(df("a.b.c.")))) // I didn't put back the aliases but you see
what I mean.
What I'd like to make work in essence is something like that> val someFunc :
String => String = ???> val myUDF = udf(someFunc)> df.withColumn("a.b[*].c",
myUDF(df("a.b[*].c"))) // the fact is that in order to be consistent with the
previous API, maybe I'd have to put something like a struct(array(struct(… which
would be troublesome because I'd have to parse the arbitrary input string  and
create something like "a.b[*].c" => struct(array(struct(
I realise the ambiguity implied in the kind of column expression, but it doesn't
seem for now available to cleanly update data inplace at an arbitrary depth.
I'll try to work on a PR that would make this possible, but any pointers would
be appreciated.
Regards,
Olivier.
 





On Fri, Sep 16, 2016 12:42 AM, Michael Armbrust michael@databricks.com
wrote:
Is what you are looking for a withColumn that support in place modification of
nested columns? or is it some other problem?
On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
I tried to use the RowEncoder but got stuck along the way :The main issue really
is that even if it's possible (however tedious) to pattern match generically
Row(s) and target the nested field that you need to modify, Rows being immutable
data structure without a method like a case class's copy or any kind of lens to
create a brand new object, I ended up stuck at the step "target and extract the
field to update" without any way to update the original Row with the new value.
To sum up, I tried : * using only dataframe's API itself + my udf - which works
   for nested structs as long as no arrays are along the way
 * trying to create a udf the can apply on Row and pattern
   match recursively the path I needed to explore/modify
 * trying to create a UDT - but we seem to be stuck in a
   strange middle-ground with 2.0 because some parts of the API ended up private
   while some stayed public making it impossible to use it now (I'd be glad if
   I'm mistaken)

All of these failed for me and I ended up converting the rows to JSON and update
using JSONPath which is…. something I'd like to avoid 'pretty please' 





On Thu, Sep 15, 2016 5:20 AM, Michael Allman michael@videoamp.com
wrote:
Hi Guys,
Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a
public API, but it is publicly accessible. I used it recently to correct some
bad data in a few nested columns in a dataframe. It wasn't an easy job, but it
made it possible. In my particular case I was not working with arrays.
Olivier, I'm interested in seeing what you come up with.
Thanks,
Michael

On Sep 14, 2016, at 10:44 AM, Fred Reiss <fr...@gmail.com> wrote:
+1 to this request. I talked last week with a product group within IBM that is
struggling with the same issue. It's pretty common in data cleaning applications
for data in the early stages to have nested lists or sets inconsistent or
incomplete schema information.
Fred
On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
Hi everyone,I'm currently trying to create a generic transformation mecanism on
a Dataframe to modify an arbitrary column regardless of the underlying the
schema.
It's "relatively" straightforward for complex types like struct<struct<…>> to
apply an arbitrary UDF on the column and replace the data "inside" the struct,
however I'm struggling to make it work for complex types containing arrays along
the way like struct<array<struct<…>>>.
Michael Armbrust seemed to allude on the mailing list/forum to a way of using
Encoders to do that, I'd be interested in any pointers, especially considering
that it's not possible to output any Row or GenericRowWithSchema from a UDF
(thanks to https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L657  it
seems).
To sum up, I'd like to find a way to apply a transformation on complex nested
datatypes (arrays and struct) on a Dataframe updating the value itself.
Regards,
Olivier Girardot

 



Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94
 


Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Michael Armbrust <mi...@databricks.com>.
Is what you are looking for a withColumn that support in place modification
of nested columns? or is it some other problem?

On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> I tried to use the RowEncoder but got stuck along the way :
> The main issue really is that even if it's possible (however tedious) to
> pattern match generically Row(s) and target the nested field that you need
> to modify, Rows being immutable data structure without a method like a case
> class's copy or any kind of lens to create a brand new object, I ended up
> stuck at the step "target and extract the field to update" without any way
> to update the original Row with the new value.
>
> To sum up, I tried :
>
>    - using only dataframe's API itself + my udf - which works for nested
>    structs as long as no arrays are along the way
>    - trying to create a udf the can apply on Row and pattern match
>    recursively the path I needed to explore/modify
>    - trying to create a UDT - but we seem to be stuck in a strange
>    middle-ground with 2.0 because some parts of the API ended up private while
>    some stayed public making it impossible to use it now (I'd be glad if I'm
>    mistaken)
>
> All of these failed for me and I ended up converting the rows to JSON and
> update using JSONPath which is…. something I'd like to avoid 'pretty
> please' [image: simple_smile]
>
>
>
> On Thu, Sep 15, 2016 5:20 AM, Michael Allman michael@videoamp.com wrote:
>
>> Hi Guys,
>>
>> Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's
>> not a public API, but it is publicly accessible. I used it recently to
>> correct some bad data in a few nested columns in a dataframe. It wasn't an
>> easy job, but it made it possible. In my particular case I was not working
>> with arrays.
>>
>> Olivier, I'm interested in seeing what you come up with.
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On Sep 14, 2016, at 10:44 AM, Fred Reiss <fr...@gmail.com> wrote:
>>
>> +1 to this request. I talked last week with a product group within IBM
>> that is struggling with the same issue. It's pretty common in data cleaning
>> applications for data in the early stages to have nested lists or sets
>> inconsistent or incomplete schema information.
>>
>> Fred
>>
>> On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>> Hi everyone,
>> I'm currently trying to create a generic transformation mecanism on a
>> Dataframe to modify an arbitrary column regardless of the underlying the
>> schema.
>>
>> It's "relatively" straightforward for complex types like
>> struct<struct<…>> to apply an arbitrary UDF on the column and replace the
>> data "inside" the struct, however I'm struggling to make it work for
>> complex types containing arrays along the way like struct<array<struct<…>>>.
>>
>> Michael Armbrust seemed to allude on the mailing list/forum to a way of
>> using Encoders to do that, I'd be interested in any pointers, especially
>> considering that it's not possible to output any Row or
>> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/spar
>> k/blob/v2.0.0/sql/catalyst/src/main/scala/org/apache/
>> spark/sql/catalyst/ScalaReflection.scala#L657 it seems).
>>
>> To sum up, I'd like to find a way to apply a transformation on complex
>> nested datatypes (arrays and struct) on a Dataframe updating the value
>> itself.
>>
>> Regards,
>>
>> *Olivier Girardot*
>>
>>
>>
>>
>
> *Olivier Girardot* | Associé
> o.girardot@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Olivier Girardot <o....@lateral-thoughts.com>.
I tried to use the RowEncoder but got stuck along the way :The main issue
really is that even if it's possible (however tedious) to pattern match
generically Row(s) and target the nested field that you need to modify, Rows
being immutable data structure without a method like a case class's copy or any
kind of lens to create a brand new object, I ended up stuck at the step "target
and extract the field to update" without any way to update the original Row with
the new value.
To sum up, I tried : * using only dataframe's API itself + my udf - which works
   for nested structs as long as no arrays are along the way
 * trying to create a udf the can apply on Row and pattern
   match recursively the path I needed to explore/modify
 * trying to create a UDT - but we seem to be stuck in a
   strange middle-ground with 2.0 because some parts of the API ended up private
   while some stayed public making it impossible to use it now (I'd be glad if
   I'm mistaken)

All of these failed for me and I ended up converting the rows to JSON and update
using JSONPath which is…. something I'd like to avoid 'pretty please' 





On Thu, Sep 15, 2016 5:20 AM, Michael Allman michael@videoamp.com
wrote:
Hi Guys,
Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a
public API, but it is publicly accessible. I used it recently to correct some
bad data in a few nested columns in a dataframe. It wasn't an easy job, but it
made it possible. In my particular case I was not working with arrays.
Olivier, I'm interested in seeing what you come up with.
Thanks,
Michael

On Sep 14, 2016, at 10:44 AM, Fred Reiss <fr...@gmail.com> wrote:
+1 to this request. I talked last week with a product group within IBM that is
struggling with the same issue. It's pretty common in data cleaning applications
for data in the early stages to have nested lists or sets inconsistent or
incomplete schema information.
Fred
On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com>  wrote:
Hi everyone,I'm currently trying to create a generic transformation mecanism on
a Dataframe to modify an arbitrary column regardless of the underlying the
schema.
It's "relatively" straightforward for complex types like struct<struct<…>> to
apply an arbitrary UDF on the column and replace the data "inside" the struct,
however I'm struggling to make it work for complex types containing arrays along
the way like struct<array<struct<…>>>.
Michael Armbrust seemed to allude on the mailing list/forum to a way of using
Encoders to do that, I'd be interested in any pointers, especially considering
that it's not possible to output any Row or GenericRowWithSchema from a UDF
(thanks to 
https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/
apache/spark/sql/catalyst/ScalaReflection.scala#L657  it seems).
To sum up, I'd like to find a way to apply a transformation on complex nested
datatypes (arrays and struct) on a Dataframe updating the value itself.
Regards,
Olivier Girardot

 



Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Fred Reiss <fr...@gmail.com>.
+1 to this request. I talked last week with a product group within IBM that
is struggling with the same issue. It's pretty common in data cleaning
applications for data in the early stages to have nested lists or sets
inconsistent or incomplete schema information.

Fred

On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi everyone,
> I'm currently trying to create a generic transformation mecanism on a
> Dataframe to modify an arbitrary column regardless of the underlying the
> schema.
>
> It's "relatively" straightforward for complex types like struct<struct<…>>
> to apply an arbitrary UDF on the column and replace the data "inside" the
> struct, however I'm struggling to make it work for complex types containing
> arrays along the way like struct<array<struct<…>>>.
>
> Michael Armbrust seemed to allude on the mailing list/forum to a way of
> using Encoders to do that, I'd be interested in any pointers, especially
> considering that it's not possible to output any Row or
> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/
> spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/
> apache/spark/sql/catalyst/ScalaReflection.scala#L657 it seems).
>
> To sum up, I'd like to find a way to apply a transformation on complex
> nested datatypes (arrays and struct) on a Dataframe updating the value
> itself.
>
> Regards,
>
> *Olivier Girardot*
>

Re: Spark SQL - Applying transformation on a struct inside an array

Posted by Fred Reiss <fr...@gmail.com>.
+1 to this request. I talked last week with a product group within IBM that
is struggling with the same issue. It's pretty common in data cleaning
applications for data in the early stages to have nested lists or sets
inconsistent or incomplete schema information.

Fred

On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi everyone,
> I'm currently trying to create a generic transformation mecanism on a
> Dataframe to modify an arbitrary column regardless of the underlying the
> schema.
>
> It's "relatively" straightforward for complex types like struct<struct<…>>
> to apply an arbitrary UDF on the column and replace the data "inside" the
> struct, however I'm struggling to make it work for complex types containing
> arrays along the way like struct<array<struct<…>>>.
>
> Michael Armbrust seemed to allude on the mailing list/forum to a way of
> using Encoders to do that, I'd be interested in any pointers, especially
> considering that it's not possible to output any Row or
> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/
> spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/
> apache/spark/sql/catalyst/ScalaReflection.scala#L657 it seems).
>
> To sum up, I'd like to find a way to apply a transformation on complex
> nested datatypes (arrays and struct) on a Dataframe updating the value
> itself.
>
> Regards,
>
> *Olivier Girardot*
>