You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/07/30 20:49:07 UTC

How to filter based on a constant value

Hi,

I would like to find out when it was the last time I paid a company with
Debit Card


This is the way I do it.

1) Find the date when I paid last
2) Find the rest of details from the row(s)

So

var HASHTAG = "XYZ"
scala> var maxdate =
ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
maxdate: org.apache.spark.sql.Row = [2015-12-15]

OK so it was 2015-12-15


Now I want to get the rest of the columns. This one works when I hard code
the maxdate!


scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
&& col("transactiondate") === "2015-12-15").select("transactiondate",
"transactiondescription", "debitamount").show
+---------------+----------------------+-----------+
|transactiondate|transactiondescription|debitamount|
+---------------+----------------------+-----------+
|     2015-12-15|  XYZ LTD CD 4636 |      10.95|
+---------------+----------------------+-----------+

Now if I want to use the var maxdate in place of "2015-12-15", how would I
do that?

I tried lit(maxdate) etc but they are all giving me error?

java.lang.RuntimeException: Unsupported literal type class
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [2015-12-15]


Thanks

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks all

scala> var maxdate =
ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0).getDate(0)
maxdate: java.sql.Date = 2015-12-15
scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
&& col("transactiondate") === maxdate).select("transactiondate",
"transactiondescription", "debitamount").show
+---------------+----------------------+-----------+
|transactiondate|transactiondescription|debitamount|
+---------------+----------------------+-----------+
|     2015-12-15|  XYZ LTD CD 4636 |      10.95|
+---------------+----------------------+-----------+



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 08:27, Mich Talebzadeh <mi...@gmail.com> wrote:

> thanks Nicholas got it
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 07:49, Nicholas Hakobian <
> nicholas.hakobian@rallyhealth.com> wrote:
>
>> From the online docs:
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/Row.html#apply(int)
>>
>> java.lang.Object apply(int i)
>> Returns the value at position i. If the value is null, null is returned.
>> The following is a mapping between Spark SQL types and return types:
>>
>> So its returning the content of the first element in the row, in this
>> case the Array (of length 1) of Date types.
>>
>>
>> Nicholas Szandor Hakobian
>> Data Scientist
>> Rally Health
>> nicholas.hakobian@rallyhealth.com
>> M: 510-295-7113
>>
>>
>> On Sat, Jul 30, 2016 at 11:41 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> thanks gents.
>>>
>>> I am trying to understand this better.
>>>
>>> As I understand a DataFrame is basically an equivalent table in
>>> relational term.
>>>
>>> so
>>>
>>> scala> var maxdate =
>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate"))
>>> maxdate: org.apache.spark.sql.DataFrame = [max(transactiondate): date]
>>>
>>> So I find the max(transactiondate) for the filter I have applied.  In
>>> sql term --> select max(transactiondate) from ll_18740868 where
>>> transactiondescription like "%HASHTAG%"
>>>
>>> Now I want to store it in a single variable and get it worked out
>>>
>>> scala> var maxdate =
>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect
>>> maxdate: Array[org.apache.spark.sql.Row] = Array([2015-12-15])
>>>
>>> Now I have the value stored in a row. I get it as follows. It is the
>>> first column of the row (actually the only column) and in date format
>>>
>>> scala> var maxdate =
>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.
>>> apply(0).getDate(0)
>>> maxdate: java.sql.Date = 2015-12-15
>>>
>>> what is the role of apply(0) here?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 July 2016 at 03:28, Xinh Huynh <xi...@gmail.com> wrote:
>>>
>>>> Hi Mitch,
>>>>
>>>> I think you were missing a step:
>>>> [your result] maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>> Since maxdate is of type Row, you would want to extract the first
>>>> column of the Row with:
>>>>
>>>> >> val maxdateStr = maxdate.getString(0)
>>>>
>>>> assuming the column type is String.
>>>> API doc is here:
>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
>>>>
>>>> Then you can do the query:
>>>>
>>>> >> col("transactiondate") === maxdateStr
>>>>
>>>> Xinh
>>>>
>>>> On Sat, Jul 30, 2016 at 5:20 PM, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> select *
>>>>> from (select *,
>>>>>              rank() over (order by transactiondate) r
>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>       ) inner
>>>>> where r=1
>>>>>
>>>>> Hi Mitch,
>>>>>
>>>>> If using SQL is fine, you can try the code above. You need to register
>>>>> ll_18740868  as temp table.
>>>>>
>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I would like to find out when it was the last time I paid a company
>>>>>> with Debit Card
>>>>>>
>>>>>>
>>>>>> This is the way I do it.
>>>>>>
>>>>>> 1) Find the date when I paid last
>>>>>> 2) Find the rest of details from the row(s)
>>>>>>
>>>>>> So
>>>>>>
>>>>>> var HASHTAG = "XYZ"
>>>>>> scala> var maxdate =
>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>
>>>>>> OK so it was 2015-12-15
>>>>>>
>>>>>>
>>>>>> Now I want to get the rest of the columns. This one works when I hard
>>>>>> code the maxdate!
>>>>>>
>>>>>>
>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>> "transactiondescription", "debitamount").show
>>>>>> +---------------+----------------------+-----------+
>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>> +---------------+----------------------+-----------+
>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>> +---------------+----------------------+-----------+
>>>>>>
>>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>>> would I do that?
>>>>>>
>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>
>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>> [2015-12-15]
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks Nicholas got it



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 07:49, Nicholas Hakobian <
nicholas.hakobian@rallyhealth.com> wrote:

> From the online docs:
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/Row.html#apply(int)
>
> java.lang.Object apply(int i)
> Returns the value at position i. If the value is null, null is returned.
> The following is a mapping between Spark SQL types and return types:
>
> So its returning the content of the first element in the row, in this case
> the Array (of length 1) of Date types.
>
>
> Nicholas Szandor Hakobian
> Data Scientist
> Rally Health
> nicholas.hakobian@rallyhealth.com
> M: 510-295-7113
>
>
> On Sat, Jul 30, 2016 at 11:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> thanks gents.
>>
>> I am trying to understand this better.
>>
>> As I understand a DataFrame is basically an equivalent table in
>> relational term.
>>
>> so
>>
>> scala> var maxdate =
>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate"))
>> maxdate: org.apache.spark.sql.DataFrame = [max(transactiondate): date]
>>
>> So I find the max(transactiondate) for the filter I have applied.  In sql
>> term --> select max(transactiondate) from ll_18740868 where
>> transactiondescription like "%HASHTAG%"
>>
>> Now I want to store it in a single variable and get it worked out
>>
>> scala> var maxdate =
>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect
>> maxdate: Array[org.apache.spark.sql.Row] = Array([2015-12-15])
>>
>> Now I have the value stored in a row. I get it as follows. It is the
>> first column of the row (actually the only column) and in date format
>>
>> scala> var maxdate =
>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.
>> apply(0).getDate(0)
>> maxdate: java.sql.Date = 2015-12-15
>>
>> what is the role of apply(0) here?
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 July 2016 at 03:28, Xinh Huynh <xi...@gmail.com> wrote:
>>
>>> Hi Mitch,
>>>
>>> I think you were missing a step:
>>> [your result] maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>> Since maxdate is of type Row, you would want to extract the first column
>>> of the Row with:
>>>
>>> >> val maxdateStr = maxdate.getString(0)
>>>
>>> assuming the column type is String.
>>> API doc is here:
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
>>>
>>> Then you can do the query:
>>>
>>> >> col("transactiondate") === maxdateStr
>>>
>>> Xinh
>>>
>>> On Sat, Jul 30, 2016 at 5:20 PM, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> select *
>>>> from (select *,
>>>>              rank() over (order by transactiondate) r
>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>       ) inner
>>>> where r=1
>>>>
>>>> Hi Mitch,
>>>>
>>>> If using SQL is fine, you can try the code above. You need to register
>>>> ll_18740868  as temp table.
>>>>
>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I would like to find out when it was the last time I paid a company
>>>>> with Debit Card
>>>>>
>>>>>
>>>>> This is the way I do it.
>>>>>
>>>>> 1) Find the date when I paid last
>>>>> 2) Find the rest of details from the row(s)
>>>>>
>>>>> So
>>>>>
>>>>> var HASHTAG = "XYZ"
>>>>> scala> var maxdate =
>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>
>>>>> OK so it was 2015-12-15
>>>>>
>>>>>
>>>>> Now I want to get the rest of the columns. This one works when I hard
>>>>> code the maxdate!
>>>>>
>>>>>
>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>> "transactiondescription", "debitamount").show
>>>>> +---------------+----------------------+-----------+
>>>>> |transactiondate|transactiondescription|debitamount|
>>>>> +---------------+----------------------+-----------+
>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>> +---------------+----------------------+-----------+
>>>>>
>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>> would I do that?
>>>>>
>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>
>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>> [2015-12-15]
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>

Re: How to filter based on a constant value

Posted by Nicholas Hakobian <ni...@rallyhealth.com>.

From the online docs:
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/Row.html#apply(int)

java.lang.Object apply(int i)
Returns the value at position i. If the value is null, null is returned.
The following is a mapping between Spark SQL types and return types:

So its returning the content of the first element in the row, in this case
the Array (of length 1) of Date types.


Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com
M: 510-295-7113


On Sat, Jul 30, 2016 at 11:41 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> thanks gents.
>
> I am trying to understand this better.
>
> As I understand a DataFrame is basically an equivalent table in relational
> term.
>
> so
>
> scala> var maxdate =
> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate"))
> maxdate: org.apache.spark.sql.DataFrame = [max(transactiondate): date]
>
> So I find the max(transactiondate) for the filter I have applied.  In sql
> term --> select max(transactiondate) from ll_18740868 where
> transactiondescription like "%HASHTAG%"
>
> Now I want to store it in a single variable and get it worked out
>
> scala> var maxdate =
> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect
> maxdate: Array[org.apache.spark.sql.Row] = Array([2015-12-15])
>
> Now I have the value stored in a row. I get it as follows. It is the
> first column of the row (actually the only column) and in date format
>
> scala> var maxdate =
> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.
> apply(0).getDate(0)
> maxdate: java.sql.Date = 2015-12-15
>
> what is the role of apply(0) here?
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 03:28, Xinh Huynh <xi...@gmail.com> wrote:
>
>> Hi Mitch,
>>
>> I think you were missing a step:
>> [your result] maxdate: org.apache.spark.sql.Row = [2015-12-15]
>> Since maxdate is of type Row, you would want to extract the first column
>> of the Row with:
>>
>> >> val maxdateStr = maxdate.getString(0)
>>
>> assuming the column type is String.
>> API doc is here:
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
>>
>> Then you can do the query:
>>
>> >> col("transactiondate") === maxdateStr
>>
>> Xinh
>>
>> On Sat, Jul 30, 2016 at 5:20 PM, ayan guha <gu...@gmail.com> wrote:
>>
>>> select *
>>> from (select *,
>>>              rank() over (order by transactiondate) r
>>>        from ll_18740868 where transactiondescription='XYZ'
>>>       ) inner
>>> where r=1
>>>
>>> Hi Mitch,
>>>
>>> If using SQL is fine, you can try the code above. You need to register
>>> ll_18740868  as temp table.
>>>
>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I would like to find out when it was the last time I paid a company
>>>> with Debit Card
>>>>
>>>>
>>>> This is the way I do it.
>>>>
>>>> 1) Find the date when I paid last
>>>> 2) Find the rest of details from the row(s)
>>>>
>>>> So
>>>>
>>>> var HASHTAG = "XYZ"
>>>> scala> var maxdate =
>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>
>>>> OK so it was 2015-12-15
>>>>
>>>>
>>>> Now I want to get the rest of the columns. This one works when I hard
>>>> code the maxdate!
>>>>
>>>>
>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>> "transactiondescription", "debitamount").show
>>>> +---------------+----------------------+-----------+
>>>> |transactiondate|transactiondescription|debitamount|
>>>> +---------------+----------------------+-----------+
>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>> +---------------+----------------------+-----------+
>>>>
>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>> would I do that?
>>>>
>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>
>>>> java.lang.RuntimeException: Unsupported literal type class
>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>> [2015-12-15]
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks gents.

I am trying to understand this better.

As I understand a DataFrame is basically an equivalent table in relational
term.

so

scala> var maxdate =
ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate"))
maxdate: org.apache.spark.sql.DataFrame = [max(transactiondate): date]

So I find the max(transactiondate) for the filter I have applied.  In sql
term --> select max(transactiondate) from ll_18740868 where
transactiondescription like "%HASHTAG%"

Now I want to store it in a single variable and get it worked out

scala> var maxdate =
ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect
maxdate: Array[org.apache.spark.sql.Row] = Array([2015-12-15])

Now I have the value stored in a row. I get it as follows. It is the first
column of the row (actually the only column) and in date format

scala> var maxdate =
ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.
apply(0).getDate(0)
maxdate: java.sql.Date = 2015-12-15

what is the role of apply(0) here?

Thanks





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 03:28, Xinh Huynh <xi...@gmail.com> wrote:

> Hi Mitch,
>
> I think you were missing a step:
> [your result] maxdate: org.apache.spark.sql.Row = [2015-12-15]
> Since maxdate is of type Row, you would want to extract the first column
> of the Row with:
>
> >> val maxdateStr = maxdate.getString(0)
>
> assuming the column type is String.
> API doc is here:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
>
> Then you can do the query:
>
> >> col("transactiondate") === maxdateStr
>
> Xinh
>
> On Sat, Jul 30, 2016 at 5:20 PM, ayan guha <gu...@gmail.com> wrote:
>
>> select *
>> from (select *,
>>              rank() over (order by transactiondate) r
>>        from ll_18740868 where transactiondescription='XYZ'
>>       ) inner
>> where r=1
>>
>> Hi Mitch,
>>
>> If using SQL is fine, you can try the code above. You need to register
>> ll_18740868  as temp table.
>>
>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> I would like to find out when it was the last time I paid a company with
>>> Debit Card
>>>
>>>
>>> This is the way I do it.
>>>
>>> 1) Find the date when I paid last
>>> 2) Find the rest of details from the row(s)
>>>
>>> So
>>>
>>> var HASHTAG = "XYZ"
>>> scala> var maxdate =
>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>
>>> OK so it was 2015-12-15
>>>
>>>
>>> Now I want to get the rest of the columns. This one works when I hard
>>> code the maxdate!
>>>
>>>
>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>> "transactiondescription", "debitamount").show
>>> +---------------+----------------------+-----------+
>>> |transactiondate|transactiondescription|debitamount|
>>> +---------------+----------------------+-----------+
>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>> +---------------+----------------------+-----------+
>>>
>>> Now if I want to use the var maxdate in place of "2015-12-15", how would
>>> I do that?
>>>
>>> I tried lit(maxdate) etc but they are all giving me error?
>>>
>>> java.lang.RuntimeException: Unsupported literal type class
>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>> [2015-12-15]
>>>
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: How to filter based on a constant value

Posted by Xinh Huynh <xi...@gmail.com>.

Hi Mitch,

I think you were missing a step:
[your result] maxdate: org.apache.spark.sql.Row = [2015-12-15]
Since maxdate is of type Row, you would want to extract the first column of
the Row with:

>> val maxdateStr = maxdate.getString(0)

assuming the column type is String.
API doc is here:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row

Then you can do the query:

>> col("transactiondate") === maxdateStr

Xinh

On Sat, Jul 30, 2016 at 5:20 PM, ayan guha <gu...@gmail.com> wrote:

> select *
> from (select *,
>              rank() over (order by transactiondate) r
>        from ll_18740868 where transactiondescription='XYZ'
>       ) inner
> where r=1
>
> Hi Mitch,
>
> If using SQL is fine, you can try the code above. You need to register
> ll_18740868  as temp table.
>
> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>>
>> Hi,
>>
>> I would like to find out when it was the last time I paid a company with
>> Debit Card
>>
>>
>> This is the way I do it.
>>
>> 1) Find the date when I paid last
>> 2) Find the rest of details from the row(s)
>>
>> So
>>
>> var HASHTAG = "XYZ"
>> scala> var maxdate =
>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>
>> OK so it was 2015-12-15
>>
>>
>> Now I want to get the rest of the columns. This one works when I hard
>> code the maxdate!
>>
>>
>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>> "transactiondescription", "debitamount").show
>> +---------------+----------------------+-----------+
>> |transactiondate|transactiondescription|debitamount|
>> +---------------+----------------------+-----------+
>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>> +---------------+----------------------+-----------+
>>
>> Now if I want to use the var maxdate in place of "2015-12-15", how would
>> I do that?
>>
>> I tried lit(maxdate) etc but they are all giving me error?
>>
>> java.lang.RuntimeException: Unsupported literal type class
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>> [2015-12-15]
>>
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to filter based on a constant value

Posted by ayan guha <gu...@gmail.com>.

Hi

Interesting problem :) And this is where my knowledge is limited. But what
I understand is this is a clustering problem of names. You may want to find
a bunch of names belongs to same group by doing say distance between them.
Spark supports few clustering algorithm under mllib.

Love to know other opinions about getting this done.

After all we all asking the same question, don't we :)
On 1 Aug 2016 00:36, "Mich Talebzadeh" <mi...@gmail.com> wrote:

> Many thanks
>
> This is a challenge of some sort.
>
> I did this for my own work.
>
> I downloaded my bank account for the past few years as a CSV format and
> loaded into Hive ORC table with databricks stuff.
>
> All tables are transactional and bucketed.
>
> A typical row looks like this (they vary from bank to bank)
>
> hive> desc ll_18740868;
> transactiondate         date
> transactiontype         string
> sortcode                string
> accountnumber           string
> transactiondescription  string
> debitamount             double
> creditamount            double
> balance                 double
>
> The columns I am interested are
>
> transactiondate         date
> transactiontype         string
> transactiondescription  string
> debitamount             double
>
> Basically paying by Debit Card means I need to look at transactiontype =
> 'DEB'. Actually I download all their codes (those three letters
> abbreviation like
>
> 1       BGC     Bank Giro Credit
> 2       BNS     Bonus
> 12      DEB     Debit Card
>
> etc
>
> My biggest challenge has been to decode transactiondescription column as
> for each payee it uses different description. For example
>
> SELECT DISTINCT transactiondescription FROM  ll_18740868 ORDER BY
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2);
>
> SAINSBURY'S S/MKT CD 5710
> SAINSBURY'S S/MKT CD 4610
> SAINSBURY'S S/MKT CD 5916
> SAINSBURY'S S/MKT CD 4628
> SAINSBURY'S S/MKT CD 4636
> SAINSBURY'S SMKT CD 5710
> SAINSBURY'S SMKT CD 4628
> SAINSBURYS CD 5916
> SAINSBURYS CD 5710
> SAINSBURYS S/MKT CD 5710
> SAINSBURYS S/MKTS CD 5916
> SAINSBURYS S/MKTS CD 4628
> SAINSBURYS S/MKTS CD 4636
> SAINSBURYS SMKT CD 5710
> SALISBURY CATHEDRA CD 5710
>
> If I look at the description I can see that they all belong to a grocer so
> if I sun them up as below I will know how much I spent in total for each
> payee sort of) by taking those substring up to -2 characters before CD
>
> CREATE TEMPORARY TABLE tmp (hashtag String, Spent float);
> INSERT INTO tmp
> SELECT DISTINCT *
> FROM (SELECT
> SUBSTRING(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
> SUM(debitamount) OVER (PARTITION BY
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2)) r
> FROM accounts.ll_18740868 WHERE transactiontype = 'DEB'
> ) RS
> ;
> SELECT * FROM tmp order by spent,hashtag;
>
>
> This is crude. What I really want is to dig in into all
> transactiondescriptions which belong to the same payee under full text.
> Like above they are all Sainsbury  whatever they describe :)
>
> It is a type of sentiment analysis for myself so I know where I am wasting
> my money most :)
>
> Cheers
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 13:37, ayan guha <gu...@gmail.com> wrote:
>
>> Hi
>>
>> here is a quick setup (Based on airlines.txt dataset):
>>
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>> *from datetime import datetime, timedelta*
>> *from pyspark.sql.types import **
>> *from pyspark.sql.functions import udf, col,rank,min*
>> *from pyspark import SparkContext, HiveContext*
>> *import sys*
>> *from pyspark.sql import Window*
>>
>>
>> *sc = SparkContext()*
>> *hc = HiveContext(sc)*
>> *customSchema = StructType([ \*
>> *StructField("airport_id", IntegerType(), True) , \*
>> *StructField("name", StringType(), True) , \*
>> *StructField("city", StringType(), True) , \*
>> *StructField("country", StringType(), True) , \*
>> *StructField("iata", StringType(), True) , \*
>> *StructField("icao", StringType(), True) , \*
>> *StructField("latitude", DecimalType(precision=20,scale=10), True) , \*
>> *StructField("longitude",DecimalType(precision=20,scale=10), True) , \*
>> *StructField("altitude", IntegerType(), True) , \*
>> *StructField("timezone", DoubleType(), True) , \*
>> *StructField("dst", StringType(), True) , \*
>> *StructField("tz_name", StringType(), True)*
>> *])*
>>
>> *inFile = sys.argv[1]*
>>
>> *df1 = df =
>> hc.read.format('com.databricks.spark.csv').options(header='false',
>> inferschema='true').load(inFile,schema=customSchema)*
>>
>>
>> *df1.registerTempTable("airlines")*
>> *df2 = hc.sql("select airport_id,altitude,r from (select *,rank() over
>> (order by altitude desc) r from airlines where altitude>100) rs where r=1")*
>> *print df2.take(10)*
>>
>> *w = Window.orderBy(df['altitude'].desc())*
>>
>> *df3 = df1.filter(df1.altitude >
>> 100).select(df1.airport_id,df1.altitude,rank().over(w).alias("r")).filter("r=1")*
>> *print df3.take(10)*
>>
>> *sc.stop()*
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Here
>> <https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html>is
>> an awesome blog from Databricks.
>>
>> HTH....
>>
>> Ayan
>>
>>
>>
>> On Sun, Jul 31, 2016 at 8:58 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> It is true that whatever an analytic function does can be done
>>> by standard SQL, with join and sub-queries. But the same routine done by
>>> analytic function is always faster, or at least as fast, when compared to
>>> standard SQL.
>>>
>>> I will try to see if I can do analytic functions with Spark FP on Data
>>> Frames. It is essentially replacing the base table with DF and using JAVA
>>> functions instead of SQL ones on top
>>>
>>> Also some text based search functions say LIKE in SQL can be replaced
>>> with CONTAINS in FP.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 July 2016 at 10:56, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> The point is, window functions are supposed designed to be faster by
>>>> doing the calculations in one pass, instead of 2 pass in case of max.
>>>>
>>>> DF supports window functions (using sql.Window) so instead of writing
>>>> sql, you can use it as well.
>>>>
>>>> Best
>>>> Ayan
>>>>
>>>> On Sun, Jul 31, 2016 at 7:48 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> yes reserved word issue thanks
>>>>>
>>>>> hive> select *
>>>>>     > from (select transactiondate, transactiondescription, debitamount
>>>>>     > , rank() over (order by transactiondate desc) r
>>>>>     > from accounts.ll_18740868 where transactiondescription like
>>>>> '%HARRODS%'
>>>>>     >  ) RS
>>>>>     > where r=1
>>>>>     > ;
>>>>> Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
>>>>> Total jobs = 1
>>>>> Launching Job 1 out of 1
>>>>> In order to change the average load for a reducer (in bytes):
>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> In order to limit the maximum number of reducers:
>>>>>   set hive.exec.reducers.max=<number>
>>>>> In order to set a constant number of reducers:
>>>>>   set mapreduce.job.reduces=<number>
>>>>> Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
>>>>> Query Hive on Spark job[0] stages:
>>>>> 0
>>>>> 1
>>>>> Status: Running (Hive on Spark job[0])
>>>>> Job Progress Format
>>>>> CurrentTime StageId_StageAttemptId:
>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>> [StageCost]
>>>>> 2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
>>>>> 2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
>>>>> 2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
>>>>> 2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
>>>>> 2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
>>>>> Status: Finished successfully in 10.10 seconds
>>>>> OK
>>>>> 2015-12-15      HARRODS LTD CD 4636     10.95   1
>>>>> Time taken: 46.546 seconds, Fetched: 1 row(s)
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 31 July 2016 at 10:36, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> I think the word "INNER" is reserved in Hive. Please change the alias
>>>>>> to something else.
>>>>>>
>>>>>> Not sure about scala, but essentially it is string replacement.
>>>>>>
>>>>>> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> thanks how about scala?
>>>>>>>
>>>>>>> BTW the same analytic code fails in Hive itself:(
>>>>>>>
>>>>>>> hive> select *
>>>>>>>     > from (select transactiondate, transactiondescription,
>>>>>>> debitamount
>>>>>>>     > from (select transactiondate, transactiondescription,
>>>>>>> debitamount
>>>>>>>     > , rank() over (order by transactiondate desc) r
>>>>>>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>>>     >      ) inner
>>>>>>>     > where r=1
>>>>>>>     > ;
>>>>>>>
>>>>>>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>>>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>>>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>>>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>>>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>> Method)
>>>>>>>         at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>         at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>>>>> *FAILED: ParseException line 6:7 Failed to recognize predicate
>>>>>>> 'inner'. Failed rule: 'identifier' in subquery source*
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> This is because Spark does  not provide a way to "bind" variables
>>>>>>>> like Oracle does.
>>>>>>>>
>>>>>>>> So you can build the sql string, like below (in python)
>>>>>>>>
>>>>>>>> val = 'XYZ'
>>>>>>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Ayan.
>>>>>>>>>
>>>>>>>>> This is the one I used
>>>>>>>>>
>>>>>>>>> scala> sqltext = """
>>>>>>>>>      |  select *
>>>>>>>>>      | from (select transactiondate, transactiondescription,
>>>>>>>>> debitamount
>>>>>>>>>      | , rank() over (order by transactiondate desc) r
>>>>>>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>>>>>      |       ) inner
>>>>>>>>>      |  where r=1
>>>>>>>>>      |    """
>>>>>>>>>
>>>>>>>>> scala> HiveContext.sql(sqltext).show
>>>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>>>
>>>>>>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a
>>>>>>>>> variable to SQL. For example in RDBMS I can do this
>>>>>>>>>
>>>>>>>>> 1> declare @pattern varchar(50)
>>>>>>>>> 2> set @pattern = 'Direct'
>>>>>>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>>>>>>> '%'||@pattern||'%'
>>>>>>>>> 4> go
>>>>>>>>> (1 row affected)
>>>>>>>>>  CHANNEL_DESC
>>>>>>>>>  --------------------
>>>>>>>>>  Direct Sales
>>>>>>>>>
>>>>>>>>> but not in Hive or Spark SQL
>>>>>>>>>
>>>>>>>>> whereas with FP it does it implicitly.
>>>>>>>>>
>>>>>>>>> col("CHANNELS").contains(HASHTAG))
>>>>>>>>>
>>>>>>>>> Unless there is a way of doing it?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> select *
>>>>>>>>>> from (select *,
>>>>>>>>>>              rank() over (order by transactiondate) r
>>>>>>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>>>>>>       ) inner
>>>>>>>>>> where r=1
>>>>>>>>>>
>>>>>>>>>> Hi Mitch,
>>>>>>>>>>
>>>>>>>>>> If using SQL is fine, you can try the code above. You need to
>>>>>>>>>> register ll_18740868  as temp table.
>>>>>>>>>>
>>>>>>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I would like to find out when it was the last time I paid a
>>>>>>>>>>> company with Debit Card
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is the way I do it.
>>>>>>>>>>>
>>>>>>>>>>> 1) Find the date when I paid last
>>>>>>>>>>> 2) Find the rest of details from the row(s)
>>>>>>>>>>>
>>>>>>>>>>> So
>>>>>>>>>>>
>>>>>>>>>>> var HASHTAG = "XYZ"
>>>>>>>>>>> scala> var maxdate =
>>>>>>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>>>>>>
>>>>>>>>>>> OK so it was 2015-12-15
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Now I want to get the rest of the columns. This one works when I
>>>>>>>>>>> hard code the maxdate!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>>>>>>> "transactiondescription", "debitamount").show
>>>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>>>
>>>>>>>>>>> Now if I want to use the var maxdate in place of "2015-12-15",
>>>>>>>>>>> how would I do that?
>>>>>>>>>>>
>>>>>>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>>>>>>
>>>>>>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>>>>>>> [2015-12-15]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Regards,
>>>>>>>>>> Ayan Guha
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

Many thanks

This is a challenge of some sort.

I did this for my own work.

I downloaded my bank account for the past few years as a CSV format and
loaded into Hive ORC table with databricks stuff.

All tables are transactional and bucketed.

A typical row looks like this (they vary from bank to bank)

hive> desc ll_18740868;
transactiondate         date
transactiontype         string
sortcode                string
accountnumber           string
transactiondescription  string
debitamount             double
creditamount            double
balance                 double

The columns I am interested are

transactiondate         date
transactiontype         string
transactiondescription  string
debitamount             double

Basically paying by Debit Card means I need to look at transactiontype =
'DEB'. Actually I download all their codes (those three letters
abbreviation like

1       BGC     Bank Giro Credit
2       BNS     Bonus
12      DEB     Debit Card

etc

My biggest challenge has been to decode transactiondescription column as
for each payee it uses different description. For example

SELECT DISTINCT transactiondescription FROM  ll_18740868 ORDER BY
substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2);

SAINSBURY'S S/MKT CD 5710
SAINSBURY'S S/MKT CD 4610
SAINSBURY'S S/MKT CD 5916
SAINSBURY'S S/MKT CD 4628
SAINSBURY'S S/MKT CD 4636
SAINSBURY'S SMKT CD 5710
SAINSBURY'S SMKT CD 4628
SAINSBURYS CD 5916
SAINSBURYS CD 5710
SAINSBURYS S/MKT CD 5710
SAINSBURYS S/MKTS CD 5916
SAINSBURYS S/MKTS CD 4628
SAINSBURYS S/MKTS CD 4636
SAINSBURYS SMKT CD 5710
SALISBURY CATHEDRA CD 5710

If I look at the description I can see that they all belong to a grocer so
if I sun them up as below I will know how much I spent in total for each
payee sort of) by taking those substring up to -2 characters before CD

CREATE TEMPORARY TABLE tmp (hashtag String, Spent float);
INSERT INTO tmp
SELECT DISTINCT *
FROM (SELECT
SUBSTRING(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
SUM(debitamount) OVER (PARTITION BY
substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2)) r
FROM accounts.ll_18740868 WHERE transactiontype = 'DEB'
) RS
;
SELECT * FROM tmp order by spent,hashtag;


This is crude. What I really want is to dig in into all
transactiondescriptions which belong to the same payee under full text.
Like above they are all Sainsbury  whatever they describe :)

It is a type of sentiment analysis for myself so I know where I am wasting
my money most :)

Cheers













Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 13:37, ayan guha <gu...@gmail.com> wrote:

> Hi
>
> here is a quick setup (Based on airlines.txt dataset):
>
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
> *from datetime import datetime, timedelta*
> *from pyspark.sql.types import **
> *from pyspark.sql.functions import udf, col,rank,min*
> *from pyspark import SparkContext, HiveContext*
> *import sys*
> *from pyspark.sql import Window*
>
>
> *sc = SparkContext()*
> *hc = HiveContext(sc)*
> *customSchema = StructType([ \*
> *StructField("airport_id", IntegerType(), True) , \*
> *StructField("name", StringType(), True) , \*
> *StructField("city", StringType(), True) , \*
> *StructField("country", StringType(), True) , \*
> *StructField("iata", StringType(), True) , \*
> *StructField("icao", StringType(), True) , \*
> *StructField("latitude", DecimalType(precision=20,scale=10), True) , \*
> *StructField("longitude",DecimalType(precision=20,scale=10), True) , \*
> *StructField("altitude", IntegerType(), True) , \*
> *StructField("timezone", DoubleType(), True) , \*
> *StructField("dst", StringType(), True) , \*
> *StructField("tz_name", StringType(), True)*
> *])*
>
> *inFile = sys.argv[1]*
>
> *df1 = df =
> hc.read.format('com.databricks.spark.csv').options(header='false',
> inferschema='true').load(inFile,schema=customSchema)*
>
>
> *df1.registerTempTable("airlines")*
> *df2 = hc.sql("select airport_id,altitude,r from (select *,rank() over
> (order by altitude desc) r from airlines where altitude>100) rs where r=1")*
> *print df2.take(10)*
>
> *w = Window.orderBy(df['altitude'].desc())*
>
> *df3 = df1.filter(df1.altitude >
> 100).select(df1.airport_id,df1.altitude,rank().over(w).alias("r")).filter("r=1")*
> *print df3.take(10)*
>
> *sc.stop()*
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Here
> <https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html>is
> an awesome blog from Databricks.
>
> HTH....
>
> Ayan
>
>
>
> On Sun, Jul 31, 2016 at 8:58 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> It is true that whatever an analytic function does can be done
>> by standard SQL, with join and sub-queries. But the same routine done by
>> analytic function is always faster, or at least as fast, when compared to
>> standard SQL.
>>
>> I will try to see if I can do analytic functions with Spark FP on Data
>> Frames. It is essentially replacing the base table with DF and using JAVA
>> functions instead of SQL ones on top
>>
>> Also some text based search functions say LIKE in SQL can be replaced
>> with CONTAINS in FP.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 July 2016 at 10:56, ayan guha <gu...@gmail.com> wrote:
>>
>>> The point is, window functions are supposed designed to be faster by
>>> doing the calculations in one pass, instead of 2 pass in case of max.
>>>
>>> DF supports window functions (using sql.Window) so instead of writing
>>> sql, you can use it as well.
>>>
>>> Best
>>> Ayan
>>>
>>> On Sun, Jul 31, 2016 at 7:48 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> yes reserved word issue thanks
>>>>
>>>> hive> select *
>>>>     > from (select transactiondate, transactiondescription, debitamount
>>>>     > , rank() over (order by transactiondate desc) r
>>>>     > from accounts.ll_18740868 where transactiondescription like
>>>> '%HARRODS%'
>>>>     >  ) RS
>>>>     > where r=1
>>>>     > ;
>>>> Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
>>>> Total jobs = 1
>>>> Launching Job 1 out of 1
>>>> In order to change the average load for a reducer (in bytes):
>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>> In order to limit the maximum number of reducers:
>>>>   set hive.exec.reducers.max=<number>
>>>> In order to set a constant number of reducers:
>>>>   set mapreduce.job.reduces=<number>
>>>> Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
>>>> Query Hive on Spark job[0] stages:
>>>> 0
>>>> 1
>>>> Status: Running (Hive on Spark job[0])
>>>> Job Progress Format
>>>> CurrentTime StageId_StageAttemptId:
>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>> [StageCost]
>>>> 2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
>>>> 2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
>>>> 2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
>>>> 2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
>>>> 2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
>>>> Status: Finished successfully in 10.10 seconds
>>>> OK
>>>> 2015-12-15      HARRODS LTD CD 4636     10.95   1
>>>> Time taken: 46.546 seconds, Fetched: 1 row(s)
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 31 July 2016 at 10:36, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> I think the word "INNER" is reserved in Hive. Please change the alias
>>>>> to something else.
>>>>>
>>>>> Not sure about scala, but essentially it is string replacement.
>>>>>
>>>>> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> thanks how about scala?
>>>>>>
>>>>>> BTW the same analytic code fails in Hive itself:(
>>>>>>
>>>>>> hive> select *
>>>>>>     > from (select transactiondate, transactiondescription,
>>>>>> debitamount
>>>>>>     > from (select transactiondate, transactiondescription,
>>>>>> debitamount
>>>>>>     > , rank() over (order by transactiondate desc) r
>>>>>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>>     >      ) inner
>>>>>>     > where r=1
>>>>>>     > ;
>>>>>>
>>>>>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>>>>>         at
>>>>>> org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>         at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>         at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>>>> *FAILED: ParseException line 6:7 Failed to recognize predicate
>>>>>> 'inner'. Failed rule: 'identifier' in subquery source*
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> This is because Spark does  not provide a way to "bind" variables
>>>>>>> like Oracle does.
>>>>>>>
>>>>>>> So you can build the sql string, like below (in python)
>>>>>>>
>>>>>>> val = 'XYZ'
>>>>>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Ayan.
>>>>>>>>
>>>>>>>> This is the one I used
>>>>>>>>
>>>>>>>> scala> sqltext = """
>>>>>>>>      |  select *
>>>>>>>>      | from (select transactiondate, transactiondescription,
>>>>>>>> debitamount
>>>>>>>>      | , rank() over (order by transactiondate desc) r
>>>>>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>>>>      |       ) inner
>>>>>>>>      |  where r=1
>>>>>>>>      |    """
>>>>>>>>
>>>>>>>> scala> HiveContext.sql(sqltext).show
>>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>>
>>>>>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a
>>>>>>>> variable to SQL. For example in RDBMS I can do this
>>>>>>>>
>>>>>>>> 1> declare @pattern varchar(50)
>>>>>>>> 2> set @pattern = 'Direct'
>>>>>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>>>>>> '%'||@pattern||'%'
>>>>>>>> 4> go
>>>>>>>> (1 row affected)
>>>>>>>>  CHANNEL_DESC
>>>>>>>>  --------------------
>>>>>>>>  Direct Sales
>>>>>>>>
>>>>>>>> but not in Hive or Spark SQL
>>>>>>>>
>>>>>>>> whereas with FP it does it implicitly.
>>>>>>>>
>>>>>>>> col("CHANNELS").contains(HASHTAG))
>>>>>>>>
>>>>>>>> Unless there is a way of doing it?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> select *
>>>>>>>>> from (select *,
>>>>>>>>>              rank() over (order by transactiondate) r
>>>>>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>>>>>       ) inner
>>>>>>>>> where r=1
>>>>>>>>>
>>>>>>>>> Hi Mitch,
>>>>>>>>>
>>>>>>>>> If using SQL is fine, you can try the code above. You need to
>>>>>>>>> register ll_18740868  as temp table.
>>>>>>>>>
>>>>>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I would like to find out when it was the last time I paid a
>>>>>>>>>> company with Debit Card
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is the way I do it.
>>>>>>>>>>
>>>>>>>>>> 1) Find the date when I paid last
>>>>>>>>>> 2) Find the rest of details from the row(s)
>>>>>>>>>>
>>>>>>>>>> So
>>>>>>>>>>
>>>>>>>>>> var HASHTAG = "XYZ"
>>>>>>>>>> scala> var maxdate =
>>>>>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>>>>>
>>>>>>>>>> OK so it was 2015-12-15
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Now I want to get the rest of the columns. This one works when I
>>>>>>>>>> hard code the maxdate!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>>>>>> "transactiondescription", "debitamount").show
>>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>>
>>>>>>>>>> Now if I want to use the var maxdate in place of "2015-12-15",
>>>>>>>>>> how would I do that?
>>>>>>>>>>
>>>>>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>>>>>
>>>>>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>>>>>> [2015-12-15]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to filter based on a constant value

Posted by ayan guha <gu...@gmail.com>.

Hi

here is a quick setup (Based on airlines.txt dataset):

--------------------------------------------------------------------------------------------------------------------------------------------------------------
*from datetime import datetime, timedelta*
*from pyspark.sql.types import **
*from pyspark.sql.functions import udf, col,rank,min*
*from pyspark import SparkContext, HiveContext*
*import sys*
*from pyspark.sql import Window*


*sc = SparkContext()*
*hc = HiveContext(sc)*
*customSchema = StructType([ \*
*StructField("airport_id", IntegerType(), True) , \*
*StructField("name", StringType(), True) , \*
*StructField("city", StringType(), True) , \*
*StructField("country", StringType(), True) , \*
*StructField("iata", StringType(), True) , \*
*StructField("icao", StringType(), True) , \*
*StructField("latitude", DecimalType(precision=20,scale=10), True) , \*
*StructField("longitude",DecimalType(precision=20,scale=10), True) , \*
*StructField("altitude", IntegerType(), True) , \*
*StructField("timezone", DoubleType(), True) , \*
*StructField("dst", StringType(), True) , \*
*StructField("tz_name", StringType(), True)*
*])*

*inFile = sys.argv[1]*

*df1 = df =
hc.read.format('com.databricks.spark.csv').options(header='false',
inferschema='true').load(inFile,schema=customSchema)*


*df1.registerTempTable("airlines")*
*df2 = hc.sql("select airport_id,altitude,r from (select *,rank() over
(order by altitude desc) r from airlines where altitude>100) rs where r=1")*
*print df2.take(10)*

*w = Window.orderBy(df['altitude'].desc())*

*df3 = df1.filter(df1.altitude >
100).select(df1.airport_id,df1.altitude,rank().over(w).alias("r")).filter("r=1")*
*print df3.take(10)*

*sc.stop()*
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Here
<https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html>is
an awesome blog from Databricks.

HTH....

Ayan



On Sun, Jul 31, 2016 at 8:58 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> It is true that whatever an analytic function does can be done by standard
> SQL, with join and sub-queries. But the same routine done by analytic
> function is always faster, or at least as fast, when compared to standard
> SQL.
>
> I will try to see if I can do analytic functions with Spark FP on Data
> Frames. It is essentially replacing the base table with DF and using JAVA
> functions instead of SQL ones on top
>
> Also some text based search functions say LIKE in SQL can be replaced with
> CONTAINS in FP.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 10:56, ayan guha <gu...@gmail.com> wrote:
>
>> The point is, window functions are supposed designed to be faster by
>> doing the calculations in one pass, instead of 2 pass in case of max.
>>
>> DF supports window functions (using sql.Window) so instead of writing
>> sql, you can use it as well.
>>
>> Best
>> Ayan
>>
>> On Sun, Jul 31, 2016 at 7:48 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> yes reserved word issue thanks
>>>
>>> hive> select *
>>>     > from (select transactiondate, transactiondescription, debitamount
>>>     > , rank() over (order by transactiondate desc) r
>>>     > from accounts.ll_18740868 where transactiondescription like
>>> '%HARRODS%'
>>>     >  ) RS
>>>     > where r=1
>>>     > ;
>>> Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
>>> Total jobs = 1
>>> Launching Job 1 out of 1
>>> In order to change the average load for a reducer (in bytes):
>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>> In order to limit the maximum number of reducers:
>>>   set hive.exec.reducers.max=<number>
>>> In order to set a constant number of reducers:
>>>   set mapreduce.job.reduces=<number>
>>> Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
>>> Query Hive on Spark job[0] stages:
>>> 0
>>> 1
>>> Status: Running (Hive on Spark job[0])
>>> Job Progress Format
>>> CurrentTime StageId_StageAttemptId:
>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>> [StageCost]
>>> 2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
>>> 2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
>>> 2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
>>> 2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
>>> 2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
>>> Status: Finished successfully in 10.10 seconds
>>> OK
>>> 2015-12-15      HARRODS LTD CD 4636     10.95   1
>>> Time taken: 46.546 seconds, Fetched: 1 row(s)
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 July 2016 at 10:36, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> I think the word "INNER" is reserved in Hive. Please change the alias
>>>> to something else.
>>>>
>>>> Not sure about scala, but essentially it is string replacement.
>>>>
>>>> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> thanks how about scala?
>>>>>
>>>>> BTW the same analytic code fails in Hive itself:(
>>>>>
>>>>> hive> select *
>>>>>     > from (select transactiondate, transactiondescription, debitamount
>>>>>     > from (select transactiondate, transactiondescription, debitamount
>>>>>     > , rank() over (order by transactiondate desc) r
>>>>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>     >      ) inner
>>>>>     > where r=1
>>>>>     > ;
>>>>>
>>>>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>>>>         at
>>>>> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>>>>         at
>>>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>>>         at
>>>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>>>         at
>>>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>>>>         at
>>>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>>>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>>>>         at
>>>>> org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>         at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>>> *FAILED: ParseException line 6:7 Failed to recognize predicate
>>>>> 'inner'. Failed rule: 'identifier' in subquery source*
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> This is because Spark does  not provide a way to "bind" variables
>>>>>> like Oracle does.
>>>>>>
>>>>>> So you can build the sql string, like below (in python)
>>>>>>
>>>>>> val = 'XYZ'
>>>>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Ayan.
>>>>>>>
>>>>>>> This is the one I used
>>>>>>>
>>>>>>> scala> sqltext = """
>>>>>>>      |  select *
>>>>>>>      | from (select transactiondate, transactiondescription,
>>>>>>> debitamount
>>>>>>>      | , rank() over (order by transactiondate desc) r
>>>>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>>>      |       ) inner
>>>>>>>      |  where r=1
>>>>>>>      |    """
>>>>>>>
>>>>>>> scala> HiveContext.sql(sqltext).show
>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>>>>> +---------------+----------------------+-----------+---+
>>>>>>>
>>>>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a
>>>>>>> variable to SQL. For example in RDBMS I can do this
>>>>>>>
>>>>>>> 1> declare @pattern varchar(50)
>>>>>>> 2> set @pattern = 'Direct'
>>>>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>>>>> '%'||@pattern||'%'
>>>>>>> 4> go
>>>>>>> (1 row affected)
>>>>>>>  CHANNEL_DESC
>>>>>>>  --------------------
>>>>>>>  Direct Sales
>>>>>>>
>>>>>>> but not in Hive or Spark SQL
>>>>>>>
>>>>>>> whereas with FP it does it implicitly.
>>>>>>>
>>>>>>> col("CHANNELS").contains(HASHTAG))
>>>>>>>
>>>>>>> Unless there is a way of doing it?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> select *
>>>>>>>> from (select *,
>>>>>>>>              rank() over (order by transactiondate) r
>>>>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>>>>       ) inner
>>>>>>>> where r=1
>>>>>>>>
>>>>>>>> Hi Mitch,
>>>>>>>>
>>>>>>>> If using SQL is fine, you can try the code above. You need to
>>>>>>>> register ll_18740868  as temp table.
>>>>>>>>
>>>>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I would like to find out when it was the last time I paid a
>>>>>>>>> company with Debit Card
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is the way I do it.
>>>>>>>>>
>>>>>>>>> 1) Find the date when I paid last
>>>>>>>>> 2) Find the rest of details from the row(s)
>>>>>>>>>
>>>>>>>>> So
>>>>>>>>>
>>>>>>>>> var HASHTAG = "XYZ"
>>>>>>>>> scala> var maxdate =
>>>>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>>>>
>>>>>>>>> OK so it was 2015-12-15
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now I want to get the rest of the columns. This one works when I
>>>>>>>>> hard code the maxdate!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>>>>> "transactiondescription", "debitamount").show
>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>>
>>>>>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>>>>>> would I do that?
>>>>>>>>>
>>>>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>>>>
>>>>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>>>>> [2015-12-15]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

It is true that whatever an analytic function does can be done by standard
SQL, with join and sub-queries. But the same routine done by analytic
function is always faster, or at least as fast, when compared to standard
SQL.

I will try to see if I can do analytic functions with Spark FP on Data
Frames. It is essentially replacing the base table with DF and using JAVA
functions instead of SQL ones on top

Also some text based search functions say LIKE in SQL can be replaced with
CONTAINS in FP.

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 10:56, ayan guha <gu...@gmail.com> wrote:

> The point is, window functions are supposed designed to be faster by doing
> the calculations in one pass, instead of 2 pass in case of max.
>
> DF supports window functions (using sql.Window) so instead of writing sql,
> you can use it as well.
>
> Best
> Ayan
>
> On Sun, Jul 31, 2016 at 7:48 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> yes reserved word issue thanks
>>
>> hive> select *
>>     > from (select transactiondate, transactiondescription, debitamount
>>     > , rank() over (order by transactiondate desc) r
>>     > from accounts.ll_18740868 where transactiondescription like
>> '%HARRODS%'
>>     >  ) RS
>>     > where r=1
>>     > ;
>> Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
>> Total jobs = 1
>> Launching Job 1 out of 1
>> In order to change the average load for a reducer (in bytes):
>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>>   set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>>   set mapreduce.job.reduces=<number>
>> Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
>> Query Hive on Spark job[0] stages:
>> 0
>> 1
>> Status: Running (Hive on Spark job[0])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId:
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>> [StageCost]
>> 2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
>> 2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
>> 2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
>> 2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
>> 2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
>> Status: Finished successfully in 10.10 seconds
>> OK
>> 2015-12-15      HARRODS LTD CD 4636     10.95   1
>> Time taken: 46.546 seconds, Fetched: 1 row(s)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 July 2016 at 10:36, ayan guha <gu...@gmail.com> wrote:
>>
>>> I think the word "INNER" is reserved in Hive. Please change the alias to
>>> something else.
>>>
>>> Not sure about scala, but essentially it is string replacement.
>>>
>>> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> thanks how about scala?
>>>>
>>>> BTW the same analytic code fails in Hive itself:(
>>>>
>>>> hive> select *
>>>>     > from (select transactiondate, transactiondescription, debitamount
>>>>     > from (select transactiondate, transactiondescription, debitamount
>>>>     > , rank() over (order by transactiondate desc) r
>>>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>>>     >      ) inner
>>>>     > where r=1
>>>>     > ;
>>>>
>>>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>>>         at
>>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>>>         at
>>>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>>>         at
>>>> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>>>         at
>>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>>         at
>>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>>         at
>>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>>>         at
>>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>         at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>> *FAILED: ParseException line 6:7 Failed to recognize predicate 'inner'.
>>>> Failed rule: 'identifier' in subquery source*
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> This is because Spark does  not provide a way to "bind" variables like
>>>>> Oracle does.
>>>>>
>>>>> So you can build the sql string, like below (in python)
>>>>>
>>>>> val = 'XYZ'
>>>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Thanks Ayan.
>>>>>>
>>>>>> This is the one I used
>>>>>>
>>>>>> scala> sqltext = """
>>>>>>      |  select *
>>>>>>      | from (select transactiondate, transactiondescription,
>>>>>> debitamount
>>>>>>      | , rank() over (order by transactiondate desc) r
>>>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>>      |       ) inner
>>>>>>      |  where r=1
>>>>>>      |    """
>>>>>>
>>>>>> scala> HiveContext.sql(sqltext).show
>>>>>> +---------------+----------------------+-----------+---+
>>>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>>>> +---------------+----------------------+-----------+---+
>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>>>> +---------------+----------------------+-----------+---+
>>>>>>
>>>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a
>>>>>> variable to SQL. For example in RDBMS I can do this
>>>>>>
>>>>>> 1> declare @pattern varchar(50)
>>>>>> 2> set @pattern = 'Direct'
>>>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>>>> '%'||@pattern||'%'
>>>>>> 4> go
>>>>>> (1 row affected)
>>>>>>  CHANNEL_DESC
>>>>>>  --------------------
>>>>>>  Direct Sales
>>>>>>
>>>>>> but not in Hive or Spark SQL
>>>>>>
>>>>>> whereas with FP it does it implicitly.
>>>>>>
>>>>>> col("CHANNELS").contains(HASHTAG))
>>>>>>
>>>>>> Unless there is a way of doing it?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>>>>
>>>>>>> select *
>>>>>>> from (select *,
>>>>>>>              rank() over (order by transactiondate) r
>>>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>>>       ) inner
>>>>>>> where r=1
>>>>>>>
>>>>>>> Hi Mitch,
>>>>>>>
>>>>>>> If using SQL is fine, you can try the code above. You need to
>>>>>>> register ll_18740868  as temp table.
>>>>>>>
>>>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I would like to find out when it was the last time I paid a company
>>>>>>>> with Debit Card
>>>>>>>>
>>>>>>>>
>>>>>>>> This is the way I do it.
>>>>>>>>
>>>>>>>> 1) Find the date when I paid last
>>>>>>>> 2) Find the rest of details from the row(s)
>>>>>>>>
>>>>>>>> So
>>>>>>>>
>>>>>>>> var HASHTAG = "XYZ"
>>>>>>>> scala> var maxdate =
>>>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>>>
>>>>>>>> OK so it was 2015-12-15
>>>>>>>>
>>>>>>>>
>>>>>>>> Now I want to get the rest of the columns. This one works when I
>>>>>>>> hard code the maxdate!
>>>>>>>>
>>>>>>>>
>>>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>>>> "transactiondescription", "debitamount").show
>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>>>> +---------------+----------------------+-----------+
>>>>>>>>
>>>>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>>>>> would I do that?
>>>>>>>>
>>>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>>>
>>>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>>>> [2015-12-15]
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to filter based on a constant value

Posted by ayan guha <gu...@gmail.com>.

The point is, window functions are supposed designed to be faster by doing
the calculations in one pass, instead of 2 pass in case of max.

DF supports window functions (using sql.Window) so instead of writing sql,
you can use it as well.

Best
Ayan

On Sun, Jul 31, 2016 at 7:48 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> yes reserved word issue thanks
>
> hive> select *
>     > from (select transactiondate, transactiondescription, debitamount
>     > , rank() over (order by transactiondate desc) r
>     > from accounts.ll_18740868 where transactiondescription like
> '%HARRODS%'
>     >  ) RS
>     > where r=1
>     > ;
> Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
> Query Hive on Spark job[0] stages:
> 0
> 1
> Status: Running (Hive on Spark job[0])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
> 2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
> 2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
> 2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
> 2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
> Status: Finished successfully in 10.10 seconds
> OK
> 2015-12-15      HARRODS LTD CD 4636     10.95   1
> Time taken: 46.546 seconds, Fetched: 1 row(s)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 10:36, ayan guha <gu...@gmail.com> wrote:
>
>> I think the word "INNER" is reserved in Hive. Please change the alias to
>> something else.
>>
>> Not sure about scala, but essentially it is string replacement.
>>
>> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> thanks how about scala?
>>>
>>> BTW the same analytic code fails in Hive itself:(
>>>
>>> hive> select *
>>>     > from (select transactiondate, transactiondescription, debitamount
>>>     > from (select transactiondate, transactiondescription, debitamount
>>>     > , rank() over (order by transactiondate desc) r
>>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>>     >      ) inner
>>>     > where r=1
>>>     > ;
>>>
>>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>>         at
>>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>> *FAILED: ParseException line 6:7 Failed to recognize predicate 'inner'.
>>> Failed rule: 'identifier' in subquery source*
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> This is because Spark does  not provide a way to "bind" variables like
>>>> Oracle does.
>>>>
>>>> So you can build the sql string, like below (in python)
>>>>
>>>> val = 'XYZ'
>>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>>
>>>>
>>>>
>>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Thanks Ayan.
>>>>>
>>>>> This is the one I used
>>>>>
>>>>> scala> sqltext = """
>>>>>      |  select *
>>>>>      | from (select transactiondate, transactiondescription,
>>>>> debitamount
>>>>>      | , rank() over (order by transactiondate desc) r
>>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>      |       ) inner
>>>>>      |  where r=1
>>>>>      |    """
>>>>>
>>>>> scala> HiveContext.sql(sqltext).show
>>>>> +---------------+----------------------+-----------+---+
>>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>>> +---------------+----------------------+-----------+---+
>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>>> +---------------+----------------------+-----------+---+
>>>>>
>>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a
>>>>> variable to SQL. For example in RDBMS I can do this
>>>>>
>>>>> 1> declare @pattern varchar(50)
>>>>> 2> set @pattern = 'Direct'
>>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>>> '%'||@pattern||'%'
>>>>> 4> go
>>>>> (1 row affected)
>>>>>  CHANNEL_DESC
>>>>>  --------------------
>>>>>  Direct Sales
>>>>>
>>>>> but not in Hive or Spark SQL
>>>>>
>>>>> whereas with FP it does it implicitly.
>>>>>
>>>>> col("CHANNELS").contains(HASHTAG))
>>>>>
>>>>> Unless there is a way of doing it?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> select *
>>>>>> from (select *,
>>>>>>              rank() over (order by transactiondate) r
>>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>>       ) inner
>>>>>> where r=1
>>>>>>
>>>>>> Hi Mitch,
>>>>>>
>>>>>> If using SQL is fine, you can try the code above. You need to
>>>>>> register ll_18740868  as temp table.
>>>>>>
>>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I would like to find out when it was the last time I paid a company
>>>>>>> with Debit Card
>>>>>>>
>>>>>>>
>>>>>>> This is the way I do it.
>>>>>>>
>>>>>>> 1) Find the date when I paid last
>>>>>>> 2) Find the rest of details from the row(s)
>>>>>>>
>>>>>>> So
>>>>>>>
>>>>>>> var HASHTAG = "XYZ"
>>>>>>> scala> var maxdate =
>>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>>
>>>>>>> OK so it was 2015-12-15
>>>>>>>
>>>>>>>
>>>>>>> Now I want to get the rest of the columns. This one works when I
>>>>>>> hard code the maxdate!
>>>>>>>
>>>>>>>
>>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>>> "transactiondescription", "debitamount").show
>>>>>>> +---------------+----------------------+-----------+
>>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>>> +---------------+----------------------+-----------+
>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>>> +---------------+----------------------+-----------+
>>>>>>>
>>>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>>>> would I do that?
>>>>>>>
>>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>>
>>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>>> [2015-12-15]
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

yes reserved word issue thanks

hive> select *
    > from (select transactiondate, transactiondescription, debitamount
    > , rank() over (order by transactiondate desc) r
    > from accounts.ll_18740868 where transactiondescription like
'%HARRODS%'
    >  ) RS
    > where r=1
    > ;
Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
Query Hive on Spark job[0] stages:
0
1
Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
Status: Finished successfully in 10.10 seconds
OK
2015-12-15      HARRODS LTD CD 4636     10.95   1
Time taken: 46.546 seconds, Fetched: 1 row(s)

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 10:36, ayan guha <gu...@gmail.com> wrote:

> I think the word "INNER" is reserved in Hive. Please change the alias to
> something else.
>
> Not sure about scala, but essentially it is string replacement.
>
> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> thanks how about scala?
>>
>> BTW the same analytic code fails in Hive itself:(
>>
>> hive> select *
>>     > from (select transactiondate, transactiondescription, debitamount
>>     > from (select transactiondate, transactiondescription, debitamount
>>     > , rank() over (order by transactiondate desc) r
>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>     >      ) inner
>>     > where r=1
>>     > ;
>>
>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>         at
>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>         at
>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>         at
>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>         at
>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>> *FAILED: ParseException line 6:7 Failed to recognize predicate 'inner'.
>> Failed rule: 'identifier' in subquery source*
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> This is because Spark does  not provide a way to "bind" variables like
>>> Oracle does.
>>>
>>> So you can build the sql string, like below (in python)
>>>
>>> val = 'XYZ'
>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>
>>>
>>>
>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Thanks Ayan.
>>>>
>>>> This is the one I used
>>>>
>>>> scala> sqltext = """
>>>>      |  select *
>>>>      | from (select transactiondate, transactiondescription, debitamount
>>>>      | , rank() over (order by transactiondate desc) r
>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>      |       ) inner
>>>>      |  where r=1
>>>>      |    """
>>>>
>>>> scala> HiveContext.sql(sqltext).show
>>>> +---------------+----------------------+-----------+---+
>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>> +---------------+----------------------+-----------+---+
>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>> +---------------+----------------------+-----------+---+
>>>>
>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a variable
>>>> to SQL. For example in RDBMS I can do this
>>>>
>>>> 1> declare @pattern varchar(50)
>>>> 2> set @pattern = 'Direct'
>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>> '%'||@pattern||'%'
>>>> 4> go
>>>> (1 row affected)
>>>>  CHANNEL_DESC
>>>>  --------------------
>>>>  Direct Sales
>>>>
>>>> but not in Hive or Spark SQL
>>>>
>>>> whereas with FP it does it implicitly.
>>>>
>>>> col("CHANNELS").contains(HASHTAG))
>>>>
>>>> Unless there is a way of doing it?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> select *
>>>>> from (select *,
>>>>>              rank() over (order by transactiondate) r
>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>       ) inner
>>>>> where r=1
>>>>>
>>>>> Hi Mitch,
>>>>>
>>>>> If using SQL is fine, you can try the code above. You need to register
>>>>> ll_18740868  as temp table.
>>>>>
>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I would like to find out when it was the last time I paid a company
>>>>>> with Debit Card
>>>>>>
>>>>>>
>>>>>> This is the way I do it.
>>>>>>
>>>>>> 1) Find the date when I paid last
>>>>>> 2) Find the rest of details from the row(s)
>>>>>>
>>>>>> So
>>>>>>
>>>>>> var HASHTAG = "XYZ"
>>>>>> scala> var maxdate =
>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>
>>>>>> OK so it was 2015-12-15
>>>>>>
>>>>>>
>>>>>> Now I want to get the rest of the columns. This one works when I hard
>>>>>> code the maxdate!
>>>>>>
>>>>>>
>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>> "transactiondescription", "debitamount").show
>>>>>> +---------------+----------------------+-----------+
>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>> +---------------+----------------------+-----------+
>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>> +---------------+----------------------+-----------+
>>>>>>
>>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>>> would I do that?
>>>>>>
>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>
>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>> [2015-12-15]
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to filter based on a constant value

Posted by ayan guha <gu...@gmail.com>.

I think the word "INNER" is reserved in Hive. Please change the alias to
something else.

Not sure about scala, but essentially it is string replacement.

On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> thanks how about scala?
>
> BTW the same analytic code fails in Hive itself:(
>
> hive> select *
>     > from (select transactiondate, transactiondescription, debitamount
>     > from (select transactiondate, transactiondescription, debitamount
>     > , rank() over (order by transactiondate desc) r
>     > from ll_18740868 where transactiondescription like '%XYZ%'
>     >      ) inner
>     > where r=1
>     > ;
>
> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>         at
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>         at
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>         at
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>         at
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>         at
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> *FAILED: ParseException line 6:7 Failed to recognize predicate 'inner'.
> Failed rule: 'identifier' in subquery source*
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:
>
>> Hi
>>
>> This is because Spark does  not provide a way to "bind" variables like
>> Oracle does.
>>
>> So you can build the sql string, like below (in python)
>>
>> val = 'XYZ'
>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>
>>
>>
>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Thanks Ayan.
>>>
>>> This is the one I used
>>>
>>> scala> sqltext = """
>>>      |  select *
>>>      | from (select transactiondate, transactiondescription, debitamount
>>>      | , rank() over (order by transactiondate desc) r
>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>      |       ) inner
>>>      |  where r=1
>>>      |    """
>>>
>>> scala> HiveContext.sql(sqltext).show
>>> +---------------+----------------------+-----------+---+
>>> |transactiondate|transactiondescription|debitamount|  r|
>>> +---------------+----------------------+-----------+---+
>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>> +---------------+----------------------+-----------+---+
>>>
>>> The issue I see is that in SQL here I cannot pass HASHTAG as a variable
>>> to SQL. For example in RDBMS I can do this
>>>
>>> 1> declare @pattern varchar(50)
>>> 2> set @pattern = 'Direct'
>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>> '%'||@pattern||'%'
>>> 4> go
>>> (1 row affected)
>>>  CHANNEL_DESC
>>>  --------------------
>>>  Direct Sales
>>>
>>> but not in Hive or Spark SQL
>>>
>>> whereas with FP it does it implicitly.
>>>
>>> col("CHANNELS").contains(HASHTAG))
>>>
>>> Unless there is a way of doing it?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> select *
>>>> from (select *,
>>>>              rank() over (order by transactiondate) r
>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>       ) inner
>>>> where r=1
>>>>
>>>> Hi Mitch,
>>>>
>>>> If using SQL is fine, you can try the code above. You need to register
>>>> ll_18740868  as temp table.
>>>>
>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I would like to find out when it was the last time I paid a company
>>>>> with Debit Card
>>>>>
>>>>>
>>>>> This is the way I do it.
>>>>>
>>>>> 1) Find the date when I paid last
>>>>> 2) Find the rest of details from the row(s)
>>>>>
>>>>> So
>>>>>
>>>>> var HASHTAG = "XYZ"
>>>>> scala> var maxdate =
>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>
>>>>> OK so it was 2015-12-15
>>>>>
>>>>>
>>>>> Now I want to get the rest of the columns. This one works when I hard
>>>>> code the maxdate!
>>>>>
>>>>>
>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>> "transactiondescription", "debitamount").show
>>>>> +---------------+----------------------+-----------+
>>>>> |transactiondate|transactiondescription|debitamount|
>>>>> +---------------+----------------------+-----------+
>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>> +---------------+----------------------+-----------+
>>>>>
>>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>>> would I do that?
>>>>>
>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>
>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>> [2015-12-15]
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks how about scala?

BTW the same analytic code fails in Hive itself:(

hive> select *
    > from (select transactiondate, transactiondescription, debitamount
    > from (select transactiondate, transactiondescription, debitamount
    > , rank() over (order by transactiondate desc) r
    > from ll_18740868 where transactiondescription like '%XYZ%'
    >      ) inner
    > where r=1
    > ;
FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
        at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
        at
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
        at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
        at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
        at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
        at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
        at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
        at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
        at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
*FAILED: ParseException line 6:7 Failed to recognize predicate 'inner'.
Failed rule: 'identifier' in subquery source*


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 10:21, ayan guha <gu...@gmail.com> wrote:

> Hi
>
> This is because Spark does  not provide a way to "bind" variables like
> Oracle does.
>
> So you can build the sql string, like below (in python)
>
> val = 'XYZ'
> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>
>
>
> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Thanks Ayan.
>>
>> This is the one I used
>>
>> scala> sqltext = """
>>      |  select *
>>      | from (select transactiondate, transactiondescription, debitamount
>>      | , rank() over (order by transactiondate desc) r
>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>      |       ) inner
>>      |  where r=1
>>      |    """
>>
>> scala> HiveContext.sql(sqltext).show
>> +---------------+----------------------+-----------+---+
>> |transactiondate|transactiondescription|debitamount|  r|
>> +---------------+----------------------+-----------+---+
>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>> +---------------+----------------------+-----------+---+
>>
>> The issue I see is that in SQL here I cannot pass HASHTAG as a variable
>> to SQL. For example in RDBMS I can do this
>>
>> 1> declare @pattern varchar(50)
>> 2> set @pattern = 'Direct'
>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>> '%'||@pattern||'%'
>> 4> go
>> (1 row affected)
>>  CHANNEL_DESC
>>  --------------------
>>  Direct Sales
>>
>> but not in Hive or Spark SQL
>>
>> whereas with FP it does it implicitly.
>>
>> col("CHANNELS").contains(HASHTAG))
>>
>> Unless there is a way of doing it?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>>
>>> select *
>>> from (select *,
>>>              rank() over (order by transactiondate) r
>>>        from ll_18740868 where transactiondescription='XYZ'
>>>       ) inner
>>> where r=1
>>>
>>> Hi Mitch,
>>>
>>> If using SQL is fine, you can try the code above. You need to register
>>> ll_18740868  as temp table.
>>>
>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I would like to find out when it was the last time I paid a company
>>>> with Debit Card
>>>>
>>>>
>>>> This is the way I do it.
>>>>
>>>> 1) Find the date when I paid last
>>>> 2) Find the rest of details from the row(s)
>>>>
>>>> So
>>>>
>>>> var HASHTAG = "XYZ"
>>>> scala> var maxdate =
>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>
>>>> OK so it was 2015-12-15
>>>>
>>>>
>>>> Now I want to get the rest of the columns. This one works when I hard
>>>> code the maxdate!
>>>>
>>>>
>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>> "transactiondescription", "debitamount").show
>>>> +---------------+----------------------+-----------+
>>>> |transactiondate|transactiondescription|debitamount|
>>>> +---------------+----------------------+-----------+
>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>> +---------------+----------------------+-----------+
>>>>
>>>> Now if I want to use the var maxdate in place of "2015-12-15", how
>>>> would I do that?
>>>>
>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>
>>>> java.lang.RuntimeException: Unsupported literal type class
>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>> [2015-12-15]
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to filter based on a constant value

Posted by ayan guha <gu...@gmail.com>.

Hi

This is because Spark does  not provide a way to "bind" variables like
Oracle does.

So you can build the sql string, like below (in python)

val = 'XYZ'
sqlbase = "select ..... where col = '<val>'".replace('<val>,val)



On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks Ayan.
>
> This is the one I used
>
> scala> sqltext = """
>      |  select *
>      | from (select transactiondate, transactiondescription, debitamount
>      | , rank() over (order by transactiondate desc) r
>      | from ll_18740868 where transactiondescription like '%XYZ%'
>      |       ) inner
>      |  where r=1
>      |    """
>
> scala> HiveContext.sql(sqltext).show
> +---------------+----------------------+-----------+---+
> |transactiondate|transactiondescription|debitamount|  r|
> +---------------+----------------------+-----------+---+
> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
> +---------------+----------------------+-----------+---+
>
> The issue I see is that in SQL here I cannot pass HASHTAG as a variable to
> SQL. For example in RDBMS I can do this
>
> 1> declare @pattern varchar(50)
> 2> set @pattern = 'Direct'
> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
> '%'||@pattern||'%'
> 4> go
> (1 row affected)
>  CHANNEL_DESC
>  --------------------
>  Direct Sales
>
> but not in Hive or Spark SQL
>
> whereas with FP it does it implicitly.
>
> col("CHANNELS").contains(HASHTAG))
>
> Unless there is a way of doing it?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:
>
>> select *
>> from (select *,
>>              rank() over (order by transactiondate) r
>>        from ll_18740868 where transactiondescription='XYZ'
>>       ) inner
>> where r=1
>>
>> Hi Mitch,
>>
>> If using SQL is fine, you can try the code above. You need to register
>> ll_18740868  as temp table.
>>
>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> I would like to find out when it was the last time I paid a company with
>>> Debit Card
>>>
>>>
>>> This is the way I do it.
>>>
>>> 1) Find the date when I paid last
>>> 2) Find the rest of details from the row(s)
>>>
>>> So
>>>
>>> var HASHTAG = "XYZ"
>>> scala> var maxdate =
>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>
>>> OK so it was 2015-12-15
>>>
>>>
>>> Now I want to get the rest of the columns. This one works when I hard
>>> code the maxdate!
>>>
>>>
>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>> "transactiondescription", "debitamount").show
>>> +---------------+----------------------+-----------+
>>> |transactiondate|transactiondescription|debitamount|
>>> +---------------+----------------------+-----------+
>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>> +---------------+----------------------+-----------+
>>>
>>> Now if I want to use the var maxdate in place of "2015-12-15", how would
>>> I do that?
>>>
>>> I tried lit(maxdate) etc but they are all giving me error?
>>>
>>> java.lang.RuntimeException: Unsupported literal type class
>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>> [2015-12-15]
>>>
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: How to filter based on a constant value

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Ayan.

This is the one I used

scala> sqltext = """
     |  select *
     | from (select transactiondate, transactiondescription, debitamount
     | , rank() over (order by transactiondate desc) r
     | from ll_18740868 where transactiondescription like '%XYZ%'
     |       ) inner
     |  where r=1
     |    """

scala> HiveContext.sql(sqltext).show
+---------------+----------------------+-----------+---+
|transactiondate|transactiondescription|debitamount|  r|
+---------------+----------------------+-----------+---+
|     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
+---------------+----------------------+-----------+---+

The issue I see is that in SQL here I cannot pass HASHTAG as a variable to
SQL. For example in RDBMS I can do this

1> declare @pattern varchar(50)
2> set @pattern = 'Direct'
3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
'%'||@pattern||'%'
4> go
(1 row affected)
 CHANNEL_DESC
 --------------------
 Direct Sales

but not in Hive or Spark SQL

whereas with FP it does it implicitly.

col("CHANNELS").contains(HASHTAG))

Unless there is a way of doing it?

Thanks














Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 July 2016 at 01:20, ayan guha <gu...@gmail.com> wrote:

> select *
> from (select *,
>              rank() over (order by transactiondate) r
>        from ll_18740868 where transactiondescription='XYZ'
>       ) inner
> where r=1
>
> Hi Mitch,
>
> If using SQL is fine, you can try the code above. You need to register
> ll_18740868  as temp table.
>
> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>>
>> Hi,
>>
>> I would like to find out when it was the last time I paid a company with
>> Debit Card
>>
>>
>> This is the way I do it.
>>
>> 1) Find the date when I paid last
>> 2) Find the rest of details from the row(s)
>>
>> So
>>
>> var HASHTAG = "XYZ"
>> scala> var maxdate =
>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>
>> OK so it was 2015-12-15
>>
>>
>> Now I want to get the rest of the columns. This one works when I hard
>> code the maxdate!
>>
>>
>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>> "transactiondescription", "debitamount").show
>> +---------------+----------------------+-----------+
>> |transactiondate|transactiondescription|debitamount|
>> +---------------+----------------------+-----------+
>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>> +---------------+----------------------+-----------+
>>
>> Now if I want to use the var maxdate in place of "2015-12-15", how would
>> I do that?
>>
>> I tried lit(maxdate) etc but they are all giving me error?
>>
>> java.lang.RuntimeException: Unsupported literal type class
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>> [2015-12-15]
>>
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to filter based on a constant value

Posted by ayan guha <gu...@gmail.com>.

select *
from (select *,
             rank() over (order by transactiondate) r
       from ll_18740868 where transactiondescription='XYZ'
      ) inner
where r=1

Hi Mitch,

If using SQL is fine, you can try the code above. You need to register
ll_18740868  as temp table.

On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

>
> Hi,
>
> I would like to find out when it was the last time I paid a company with
> Debit Card
>
>
> This is the way I do it.
>
> 1) Find the date when I paid last
> 2) Find the rest of details from the row(s)
>
> So
>
> var HASHTAG = "XYZ"
> scala> var maxdate =
> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>
> OK so it was 2015-12-15
>
>
> Now I want to get the rest of the columns. This one works when I hard code
> the maxdate!
>
>
> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
> && col("transactiondate") === "2015-12-15").select("transactiondate",
> "transactiondescription", "debitamount").show
> +---------------+----------------------+-----------+
> |transactiondate|transactiondescription|debitamount|
> +---------------+----------------------+-----------+
> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
> +---------------+----------------------+-----------+
>
> Now if I want to use the var maxdate in place of "2015-12-15", how would I
> do that?
>
> I tried lit(maxdate) etc but they are all giving me error?
>
> java.lang.RuntimeException: Unsupported literal type class
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> [2015-12-15]
>
>
> Thanks
>



-- 
Best Regards,
Ayan Guha