You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by James Starks <su...@protonmail.com.INVALID> on 2018/08/07 15:09:11 UTC

Newbie question on how to extract column value

I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code

    sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()

Now I want to perform filter and map function on col_b value. In plain scala it would be something like

    Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3, "unknown")).filter { case (_, url) => isValid(url) }.map { case (id, url)  => (id, pathOf(url)) }

where filter will remove invalid url, and then map (id, url) to (id, path of url).

However, when applying this concept to spark sql with code snippet

    sparkSession.sql("...").filter(isValid($"url"))

Compiler complains type mismatch because $"url" is ColumnName type. How can I extract column value i.e. http://... for the column url in order to perform filter function?

Thanks

Java 1.8.0
Scala 2.11.8
Spark 2.1.0

Re: Newbie question on how to extract column value

Posted by James Starks <su...@protonmail.com.INVALID>.

Because of some legacy issues I can't immediately upgrade spark version. But I try filter data before loading it into spark based on the suggestion by

     val df = sparkSession.read.format("jdbc").option(...).option("dbtable", "(select .. from ... where url <> '') table_name")....load()
     df.createOrReplaceTempView("new_table")

Then perform custom operation do the trick.

    sparkSession.sql("select id, url from new_table").as[(String, String)].map { case (id, url) =>
       val derived_data = ... // operation on url
       (id, derived_data)
    }.show()

Thanks for the advice, it's really helpful!

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On August 7, 2018 5:33 PM, Gourav Sengupta <go...@gmail.com> wrote:

> Hi James,
>
> It is always advisable to use the latest SPARK version. That said, can you please giving a try to dataframes and udf if possible. I think, that would be a much scalable way to address the issue.
>
> Also in case possible, it is always advisable to use the filter option before fetching the data to Spark.
>
> Thanks and Regards,
> Gourav
>
> On Tue, Aug 7, 2018 at 4:09 PM, James Starks <su...@protonmail.com.invalid> wrote:
>
>> I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code
>>
>>     sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()
>>
>> Now I want to perform filter and map function on col_b value. In plain scala it would be something like
>>
>>     Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3, "unknown")).filter { case (_, url) => isValid(url) }.map { case (id, url)  => (id, pathOf(url)) }
>>
>> where filter will remove invalid url, and then map (id, url) to (id, path of url).
>>
>> However, when applying this concept to spark sql with code snippet
>>
>>     sparkSession.sql("...").filter(isValid($"url"))
>>
>> Compiler complains type mismatch because $"url" is ColumnName type. How can I extract column value i.e. http://... for the column url in order to perform filter function?
>>
>> Thanks
>>
>> Java 1.8.0
>> Scala 2.11.8
>> Spark 2.1.0

Re: Newbie question on how to extract column value

Posted by Gourav Sengupta <go...@gmail.com>.

Hi James,

It is always advisable to use the latest SPARK version. That said, can you
please giving a try to dataframes and udf if possible. I think, that would
be a much scalable way to address the issue.

Also in case possible, it is always advisable to use the filter option
before fetching the data to Spark.


Thanks and Regards,
Gourav

On Tue, Aug 7, 2018 at 4:09 PM, James Starks <suserft@protonmail.com.invalid
> wrote:

> I am very new to Spark. Just successfully setup Spark SQL connecting to
> postgresql database, and am able to display table with code
>
>     sparkSession.sql("SELECT id, url from table_a where col_b <> ''
> ").show()
>
> Now I want to perform filter and map function on col_b value. In plain
> scala it would be something like
>
>     Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3,
> "unknown")).filter { case (_, url) => isValid(url) }.map { case (id, url)
> => (id, pathOf(url)) }
>
> where filter will remove invalid url, and then map (id, url) to (id, path
> of url).
>
> However, when applying this concept to spark sql with code snippet
>
>     sparkSession.sql("...").filter(isValid($"url"))
>
> Compiler complains type mismatch because $"url" is ColumnName type. How
> can I extract column value i.e. http://... for the column url in order to
> perform filter function?
>
> Thanks
>
> Java 1.8.0
> Scala 2.11.8
> Spark 2.1.0
>
>
>
>
>
>