You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bill Schwanitz <bi...@bilsch.org> on 2017/03/01 16:21:54 UTC
question on transforms for spark 2.0 dataset
Hi all,
I'm fairly new to spark and scala so bear with me.
I'm working with a dataset containing a set of column / fields. The data is
stored in hdfs as parquet and is sourced from a postgres box so fields and
values are reasonably well formed. We are in the process of trying out a
switch from pentaho and various sql databases to pulling data into hdfs and
applying transforms / new datasets with processing being done in spark (
and other tools - evaluation )
A rough version of the code I'm running so far:
val sample_data = spark.read.parquet("my_data_input")
val example_row = spark.sql("select * from parquet.my_data_input where id =
123").head
I want to apply a trim operation on a set of fields - lets call them
field1, field2, field3 and field4.
What is the best way to go about applying those trims and creating a new
dataset? Can I apply the trip to all fields in a single map? or do I need
to apply multiple map functions?
When I try the map ( even with a single )
scala> val transformed_data = sample_data.map(
| _.trim(col("field1"))
| .trim(col("field2"))
| .trim(col("field3"))
| .trim(col("field4"))
| )
I end up with the following error:
<console>:26: error: value trim is not a member of org.apache.spark.sql.Row
_.trim(col("field1"))
^
Any ideas / guidance would be appreciated!
Re: question on transforms for spark 2.0 dataset
Posted by Bill Schwanitz <bi...@bilsch.org>.
Subhash,
Yea that did the trick thanks!
On Wed, Mar 1, 2017 at 12:20 PM, Subhash Sriram <su...@gmail.com>
wrote:
> If I am understanding your problem correctly, I think you can just create
> a new DataFrame that is a transformation of sample_data by first
> registering sample_data as a temp table.
>
> //Register temp table
> sample_data.createOrReplaceTempView("sql_sample_data")
>
> //Create new DataSet with transformed values
> val transformed = spark.sql("select trim(field1) as field1, trim(field2)
> as field2...... from sql_sample_data")
>
> //Test
> transformed.show(10)
>
> I hope that helps!
> Subhash
>
>
> On Wed, Mar 1, 2017 at 12:04 PM, Marco Mistroni <mm...@gmail.com>
> wrote:
>
>> Hi I think u need an UDF if u want to transform a column....
>> Hth
>>
>> On 1 Mar 2017 4:22 pm, "Bill Schwanitz" <bi...@bilsch.org> wrote:
>>
>>> Hi all,
>>>
>>> I'm fairly new to spark and scala so bear with me.
>>>
>>> I'm working with a dataset containing a set of column / fields. The data
>>> is stored in hdfs as parquet and is sourced from a postgres box so fields
>>> and values are reasonably well formed. We are in the process of trying out
>>> a switch from pentaho and various sql databases to pulling data into hdfs
>>> and applying transforms / new datasets with processing being done in spark
>>> ( and other tools - evaluation )
>>>
>>> A rough version of the code I'm running so far:
>>>
>>> val sample_data = spark.read.parquet("my_data_input")
>>>
>>> val example_row = spark.sql("select * from parquet.my_data_input where
>>> id = 123").head
>>>
>>> I want to apply a trim operation on a set of fields - lets call them
>>> field1, field2, field3 and field4.
>>>
>>> What is the best way to go about applying those trims and creating a new
>>> dataset? Can I apply the trip to all fields in a single map? or do I need
>>> to apply multiple map functions?
>>>
>>> When I try the map ( even with a single )
>>>
>>> scala> val transformed_data = sample_data.map(
>>> | _.trim(col("field1"))
>>> | .trim(col("field2"))
>>> | .trim(col("field3"))
>>> | .trim(col("field4"))
>>> | )
>>>
>>> I end up with the following error:
>>>
>>> <console>:26: error: value trim is not a member of
>>> org.apache.spark.sql.Row
>>> _.trim(col("field1"))
>>> ^
>>>
>>> Any ideas / guidance would be appreciated!
>>>
>>
>
Re: question on transforms for spark 2.0 dataset
Posted by Subhash Sriram <su...@gmail.com>.
If I am understanding your problem correctly, I think you can just create a
new DataFrame that is a transformation of sample_data by first registering
sample_data as a temp table.
//Register temp table
sample_data.createOrReplaceTempView("sql_sample_data")
//Create new DataSet with transformed values
val transformed = spark.sql("select trim(field1) as field1, trim(field2) as
field2...... from sql_sample_data")
//Test
transformed.show(10)
I hope that helps!
Subhash
On Wed, Mar 1, 2017 at 12:04 PM, Marco Mistroni <mm...@gmail.com> wrote:
> Hi I think u need an UDF if u want to transform a column....
> Hth
>
> On 1 Mar 2017 4:22 pm, "Bill Schwanitz" <bi...@bilsch.org> wrote:
>
>> Hi all,
>>
>> I'm fairly new to spark and scala so bear with me.
>>
>> I'm working with a dataset containing a set of column / fields. The data
>> is stored in hdfs as parquet and is sourced from a postgres box so fields
>> and values are reasonably well formed. We are in the process of trying out
>> a switch from pentaho and various sql databases to pulling data into hdfs
>> and applying transforms / new datasets with processing being done in spark
>> ( and other tools - evaluation )
>>
>> A rough version of the code I'm running so far:
>>
>> val sample_data = spark.read.parquet("my_data_input")
>>
>> val example_row = spark.sql("select * from parquet.my_data_input where id
>> = 123").head
>>
>> I want to apply a trim operation on a set of fields - lets call them
>> field1, field2, field3 and field4.
>>
>> What is the best way to go about applying those trims and creating a new
>> dataset? Can I apply the trip to all fields in a single map? or do I need
>> to apply multiple map functions?
>>
>> When I try the map ( even with a single )
>>
>> scala> val transformed_data = sample_data.map(
>> | _.trim(col("field1"))
>> | .trim(col("field2"))
>> | .trim(col("field3"))
>> | .trim(col("field4"))
>> | )
>>
>> I end up with the following error:
>>
>> <console>:26: error: value trim is not a member of
>> org.apache.spark.sql.Row
>> _.trim(col("field1"))
>> ^
>>
>> Any ideas / guidance would be appreciated!
>>
>
Re: question on transforms for spark 2.0 dataset
Posted by Marco Mistroni <mm...@gmail.com>.
Hi I think u need an UDF if u want to transform a column....
Hth
On 1 Mar 2017 4:22 pm, "Bill Schwanitz" <bi...@bilsch.org> wrote:
> Hi all,
>
> I'm fairly new to spark and scala so bear with me.
>
> I'm working with a dataset containing a set of column / fields. The data
> is stored in hdfs as parquet and is sourced from a postgres box so fields
> and values are reasonably well formed. We are in the process of trying out
> a switch from pentaho and various sql databases to pulling data into hdfs
> and applying transforms / new datasets with processing being done in spark
> ( and other tools - evaluation )
>
> A rough version of the code I'm running so far:
>
> val sample_data = spark.read.parquet("my_data_input")
>
> val example_row = spark.sql("select * from parquet.my_data_input where id
> = 123").head
>
> I want to apply a trim operation on a set of fields - lets call them
> field1, field2, field3 and field4.
>
> What is the best way to go about applying those trims and creating a new
> dataset? Can I apply the trip to all fields in a single map? or do I need
> to apply multiple map functions?
>
> When I try the map ( even with a single )
>
> scala> val transformed_data = sample_data.map(
> | _.trim(col("field1"))
> | .trim(col("field2"))
> | .trim(col("field3"))
> | .trim(col("field4"))
> | )
>
> I end up with the following error:
>
> <console>:26: error: value trim is not a member of org.apache.spark.sql.Row
> _.trim(col("field1"))
> ^
>
> Any ideas / guidance would be appreciated!
>