You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@griffin.apache.org by William Guo <gu...@outlook.com> on 2017/09/30 01:23:39 UTC

Re: Griffin Accuracy Doubts

hi Vidyaranya,


Please check my comments inline.


Thanks,

William


________________________________
From: Vidyaranya Malladi <vi...@nielsen.com>
Sent: Friday, September 29, 2017 10:06 PM
To: guoyp@apache.org; guoiii@outlook.com
Cc: dev@griffin.incubator.apache.org; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
Subject: Griffin Accuracy Doubts

Dear William,
 Good day. Please let us know on the below questions that we have

1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well

--Yes, Since we have change our backend to fully support sql.

2. Would Griffin support the result metric col wise instead of overall matched count.
eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100

--Yes, Since we have change our backend to fully support sql.

Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00


3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals

--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
but how to provide better user experience from front end for UDF, we are still considering it.

4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc

--Support SQL is already done, will have a quick release after we fully tested and reviewed.

5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.


select col from src order by prikey
select col from tgt order by prikey

src.col vs tgt.col instead of the join key . Please let us know.

-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.

Thank you very much.



Thanks & Regards
M. Vidyaranya
NDX
TCS - Nielsen Relationship
Mobile  : +91 9790591311

Re: Re: Re: Re: Re: Griffin Accuracy Doubts

Posted by Ananthanarayanan Ms <an...@nielsen.com>.

Hi Lionel,
Thank you and the griffin-dsl works for this usecase and yes, awaiting for
the next release to capture other usecase as one of our usecase is to use
more than 1 table joined by a sql and use it as src vs again multi sliced
table joined as tgt sql and do a accuracy check.


Regards,
Ananthanarayanan.M.S

On Thu, Nov 30, 2017 at 7:42 PM, Lionel Liu <bh...@163.com> wrote:

> Hi Ananthanarayanan.M.S,
>   Thanks for your reply. I think you're right, in this version, the
> griffin-dsl can deal with accuracy and profiling cases, but for most users
> they prefer sql.
>   Actually, in this version, considering of streaming mode, we made some
> specific process for spark sql, it could deal with some profiling cases,
> but fails in other ones. We'll fix this bug, and try to keep sql directly
> for the measure engine, you'll be able to use sql in next version.
>
>
>
> --
> Regards,
> Lionel, Liu
>
> At 2017-11-29 12:07:39, "Ananthanarayanan Ms" <ananthanarayanan.ms.ap@
> nielsen.com> wrote:
>
> Hi Lionel,
>  Sure, will try griffin-dsl and update here. I was willing to keep a
> distance from this feature griffin-dsl as it needed a learning curve on how
> to use the expression and attributes and we are planning for our end users
> who would not have this background and if not then we need a intermediate
> step to translate their need to this dsl unlike sql where there are
> proficient already.
>
> Thank you.
>
>
> Regards,
> Ananthanarayanan.M.S
>
> On Tue, Nov 28, 2017 at 8:38 AM, Lionel Liu <bh...@163.com> wrote:
>
>> Hi Ananthanarayanan.M.S,
>> Sorry for the late reply, I think you are trying accuracy measure by
>> configuring the field "dsl.type" as "spark-sql", not as "griffin-dsl",
>> right?
>> If you just want to use accuracy, you'd better try "griffin-dsl", and
>> configure "dq.type" as "accuracy", with "rule" like this "source.name =
>> target.name and source.age = target.age".
>> For "spark-sql" type, we add a column "__tmst" into the sql, and group by
>> it, to fit some streaming cases, but it does need some background
>> knowledge, for example, like accuracy case, when there're multiple data
>> sources, sql engine can not recognize the column "__tmst" belongs to source
>> or target, so it is ambiguous, we'll fix this bug in later version.
>> After all, for accuracy cases, we recommend you use the configuration of
>> rule like this:
>> {
>>   "dsl.type": "griffin-dsl",
>>   "dq.type": "accuracy",
>>   "rule": "source.name = target.name and source.age = target.age"
>> }
>>
>>
>> Hope this can help you, thanks a lot.
>>
>>
>>
>> --
>> Regards,
>> Lionel, Liu
>>
>> At 2017-11-21 20:42:45, "Ananthanarayanan Ms" <
>> ananthanarayanan.ms.ap@nielsen.com> wrote:
>>
>> Hi Lionel,
>>  Even after changing, the issue still remains and looks like during join
>> as both src and tgt are already populated withColumn *__tmst *hence here
>> it shows* '__tmst' is ambiguous. *Could you please help to check and
>> clarify.
>>
>> 17/11/21 08:38:46 ERROR engine.SparkSqlEngine: run spark sql [ SELECT
>> `__tmst` AS `__tmst`, source.product_id, source.market_id,
>> source.period_id,source.s5_volume_rom_distrib FROM source LEFT JOIN
>> target *ON *coalesce(source.period_id, 'null') =
>> coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') =
>> coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null')
>> = coalesce(target.product_id, 'null') *WHERE *(NOT (source.product_id IS
>> NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND
>> target.product_id IS NULL AND target.market_id IS NULL AND target.period_id
>> IS NULL GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could
>> be: __tmst#145L, __tmst#146L.; line 1 pos 566
>>
>>
>>
>> Regards,
>> Ananthanarayanan.M.S
>>
>> On Tue, Nov 21, 2017 at 8:05 AM, Lionel Liu <bh...@163.com> wrote:
>>
>>> Hi Ananthanarayanan.M.S,
>>>
>>> Actually, we add a new column `__tmst` for each table, to persist the
>>> timestamp of each data row.
>>> But the sql you list seems not like our exact logic, so I want to see
>>> your accuracy rule statement to have a check.
>>>
>>> if your accuracy rule statement is like:
>>> "source.product_id = target.product_id, source.market_id =
>>> target.market_id, source.period_id = target.period_id",
>>> the sql for that step should be:
>>> "SELECT source.* FROM source LEFT JOIN target ON coalesce(source.
>>> product_id, 'null') = coalesce(target. product_id, 'null') and
>>> coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and
>>> coalesce(source. period_id, 'null') = coalesce(target. period_id,
>>> 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS
>>> NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND
>>> source.market_id IS NULL AND source. period_id IS NULL".
>>>
>>> Hope this can help you, thanks.
>>>
>>>
>>> --
>>> Regards,
>>> Lionel, Liu
>>>
>>>
>>> At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote:
>>> >hello Ananthanarayanan.M.S,
>>> >
>>> >We support some partitioned tables, could you show us your partitioned table description for check and some sample data will be helpful?
>>> >
>>> >Thanks,
>>> >William
>>> >________________________________
>>> >From: Ananthanarayanan Ms <an...@nielsen.com>
>>> >Sent: Tuesday, November 21, 2017 3:53 AM
>>> >To: William Guo
>>> >Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
>>> >Subject: Re: Re: Griffin Accuracy Doubts
>>> >
>>> >Hello William / Lionel,
>>> >Could you please let us know if we can use 1.6. Could you please let know if the partitioned tables alone can be input as src/tgt as below err is got when we build and run the master code.
>>> >
>>> >17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL )) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 413
>>> >
>>> >
>>> >
>>> >Regards,
>>> >Ananthanarayanan.M.S
>>> >
>>> >On Wed, Oct 4, 2017 at 2:39 PM, William Guo <gu...@apache.org>> wrote:
>>> >
>>> >hi Abishek,
>>> >
>>> >
>>> >For SQL support, the version should be 0.1.6, we will release in Middle of October.
>>> >
>>> >
>>> >
>>> >Thanks,
>>> >
>>> >William
>>> >
>>> >
>>> >________________________________
>>> >From: Abishek Kunduru <ab...@nielsen.com>>
>>> >Sent: Tuesday, October 3, 2017 12:35 AM
>>> >To: 刘力力
>>> >Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Vidyaranya Malladi; guoyp@apache.org<ma...@apache.org>; Mara Preotescu; Ananthanarayanan Ms
>>> >Subject: Re: Re: Griffin Accuracy Doubts
>>> >
>>> >Thanks William and Lionel for the quick response. Can you please tell us when would the Griffin with SQL support be released and its version number?
>>> >
>>> >Regarding the second question, Vidya is not referring to changing the source or target data but instead DQ feature to compare source and target columns and output the difference in terms of percentage change (Vidya gave an example as well in the below table which outputs the difference between source and target numeric columns).
>>> >
>>> >Best,
>>> >Abishek
>>> >
>>> >
>>> >On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 <bh...@163.com>> wrote:
>>> >Hi Vidyaranya,
>>> >
>>> >I have some more comments about the answers from William. Pls check the following comments.
>>> >
>>> >
>>> >1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>>> >
>>> >--Yes, Since we have change our backend to fully support sql.
>>> >
>>> >2. Would Griffin support the result metric col wise instead of overall matched count.
>>> >eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>>> >
>>> >--Yes, Since we have change our backend to fully support sql.
>>> >
>>> >Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>>> >f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>>> >f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>>> >f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>>> >
>>> >
>>> >--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.
>>> >
>>> >
>>> >3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>>> >
>>> >--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>>> >but how to provide better user experience from front end for UDF, we are still considering it.
>>> >
>>> >4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>>> >
>>> >--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>>> >
>>> >5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>>> >
>>> >
>>> >select col from src order by prikey
>>> >select col from tgt order by prikey
>>> >
>>> >src.col vs tgt.col instead of the join key . Please let us know.
>>> >
>>> >-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>>> >
>>> >--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.
>>> >
>>> >Thanks very much.
>>> >
>>> >
>>> >--
>>> >Regards,
>>> >Lionel, Liu
>>> >
>>> >
>>> >At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com>> wrote:
>>> >>hi Vidyaranya,
>>> >>
>>> >>
>>> >>Please check my comments inline.
>>> >>
>>> >>
>>> >>Thanks,
>>> >>
>>> >>William
>>> >>
>>> >>
>>> >>________________________________
>>> >>From: Vidyaranya Malladi <vi...@nielsen.com>>
>>> >>Sent: Friday, September 29, 2017 10:06 PM
>>> >>To: guoyp@apache.org<ma...@apache.org>; guoiii@outlook.com<ma...@outlook.com>
>>> >>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>>> >>Subject: Griffin Accuracy Doubts
>>> >>
>>> >>Dear William,
>>> >> Good day. Please let us know on the below questions that we have
>>> >>
>>> >>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>>> >>
>>> >>--Yes, Since we have change our backend to fully support sql.
>>> >>
>>> >>2. Would Griffin support the result metric col wise instead of overall matched count.
>>> >>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>>> >>
>>> >>--Yes, Since we have change our backend to fully support sql.
>>> >>
>>> >>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>>> >>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>>> >>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>>> >>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>>> >>
>>> >>
>>> >>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>>> >>
>>> >>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>>> >>but how to provide better user experience from front end for UDF, we are still considering it.
>>> >>
>>> >>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>>> >>
>>> >>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>>> >>
>>> >>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>>> >>
>>> >>
>>> >>select col from src order by prikey
>>> >>select col from tgt order by prikey
>>> >>
>>> >>src.col vs tgt.col instead of the join key . Please let us know.
>>> >>
>>> >>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>>> >>
>>> >>Thank you very much.
>>> >>
>>> >>
>>> >>
>>> >>Thanks & Regards
>>> >>M. Vidyaranya
>>> >>NDX
>>> >>TCS - Nielsen Relationship
>>> >>Mobile  : +91 9790591311
>>> >
>>> >
>>> >
>>> >【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  <http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>
>>> >
>>> >
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
>

Re:Re: Re: Re: Re: Griffin Accuracy Doubts

Posted by Lionel Liu <bh...@163.com>.

Hi Ananthanarayanan.M.S,
  Thanks for your reply. I think you're right, in this version, the griffin-dsl can deal with accuracy and profiling cases, but for most users they prefer sql. 
  Actually, in this version, considering of streaming mode, we made some specific process for spark sql, it could deal with some profiling cases, but fails in other ones. We'll fix this bug, and try to keep sql directly for the measure engine, you'll be able to use sql in next version.




--

Regards,
Lionel, Liu

At 2017-11-29 12:07:39, "Ananthanarayanan Ms" <an...@nielsen.com> wrote:

Hi Lionel,
 Sure, will try griffin-dsl and update here. I was willing to keep a distance from this feature griffin-dsl as it needed a learning curve on how to use the expression and attributes and we are planning for our end users who would not have this background and if not then we need a intermediate step to translate their need to this dsl unlike sql where there are proficient already.


Thank you.




Regards,
Ananthanarayanan.M.S


On Tue, Nov 28, 2017 at 8:38 AM, Lionel Liu <bh...@163.com> wrote:

Hi Ananthanarayanan.M.S,
Sorry for the late reply, I think you are trying accuracy measure by configuring the field "dsl.type" as "spark-sql", not as "griffin-dsl", right?
If you just want to use accuracy, you'd better try "griffin-dsl", and configure "dq.type" as "accuracy", with "rule" like this "source.name = target.name and source.age = target.age".
For "spark-sql" type, we add a column "__tmst" into the sql, and group by it, to fit some streaming cases, but it does need some background knowledge, for example, like accuracy case, when there're multiple data sources, sql engine can not recognize the column "__tmst" belongs to source or target, so it is ambiguous, we'll fix this bug in later version.
After all, for accuracy cases, we recommend you use the configuration of rule like this:
{
  "dsl.type": "griffin-dsl",
  "dq.type": "accuracy",
  "rule": "source.name = target.name and source.age = target.age"
}




Hope this can help you, thanks a lot.




--

Regards,
Lionel, Liu

At 2017-11-21 20:42:45, "Ananthanarayanan Ms" <an...@nielsen.com> wrote:

Hi Lionel,
 Even after changing, the issue still remains and looks like during join as both src and tgt are already populated withColumn __tmst hence here it shows '__tmst' is ambiguous. Could you please help to check and clarify.


17/11/21 08:38:46 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,source.s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND target.market_id IS NULL AND target.period_id IS NULL GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 566
 




Regards,
Ananthanarayanan.M.S


On Tue, Nov 21, 2017 at 8:05 AM, Lionel Liu <bh...@163.com> wrote:

Hi Ananthanarayanan.M.S,


Actually, we add a new column `__tmst` for each table, to persist the timestamp of each data row.
But the sql you list seems not like our exact logic, so I want to see your accuracy rule statement to have a check.


if your accuracy rule statement is like:
"source.product_id = target.product_id, source.market_id = target.market_id, source.period_id = target.period_id",
the sql for that step should be:
"SELECT source.* FROM source LEFT JOIN target ON coalesce(source. product_id, 'null') = coalesce(target. product_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source. period_id, 'null') = coalesce(target. period_id, 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL".


Hope this can help you, thanks.



--

Regards,
Lionel, Liu



At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote:
>hello Ananthanarayanan.M.S,
>
>We support some partitioned tables, could you show us your partitioned table description for check and some sample data will be helpful?
>
>Thanks,
>William
>________________________________
>From: Ananthanarayanan Ms <an...@nielsen.com>
>Sent: Tuesday, November 21, 2017 3:53 AM
>To: William Guo
>Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Hello William / Lionel,
>Could you please let us know if we can use 1.6. Could you please let know if the partitioned tables alone can be input as src/tgt as below err is got when we build and run the master code.
>
>17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL )) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 413
>
>
>
>Regards,
>Ananthanarayanan.M.S
>
>On Wed, Oct 4, 2017 at 2:39 PM, William Guo <gu...@apache.org>> wrote:
>
>hi Abishek,
>
>
>For SQL support, the version should be 0.1.6, we will release in Middle of October.
>
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Abishek Kunduru <ab...@nielsen.com>>
>Sent: Tuesday, October 3, 2017 12:35 AM
>To: 刘力力
>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Vidyaranya Malladi; guoyp@apache.org<ma...@apache.org>; Mara Preotescu; Ananthanarayanan Ms
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Thanks William and Lionel for the quick response. Can you please tell us when would the Griffin with SQL support be released and its version number?
>
>Regarding the second question, Vidya is not referring to changing the source or target data but instead DQ feature to compare source and target columns and output the difference in terms of percentage change (Vidya gave an example as well in the below table which outputs the difference between source and target numeric columns).
>
>Best,
>Abishek
>
>
>On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 <bh...@163.com>> wrote:
>Hi Vidyaranya,
>
>I have some more comments about the answers from William. Pls check the following comments.
>
>
>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>
>
>--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still considering it.
>
>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>
>--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.
>
>Thanks very much.
>
>
>--
>Regards,
>Lionel, Liu
>
>
>At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com>> wrote:
>>hi Vidyaranya,
>>
>>
>>Please check my comments inline.
>>
>>
>>Thanks,
>>
>>William
>>
>>
>>________________________________
>>From: Vidyaranya Malladi <vi...@nielsen.com>>
>>Sent: Friday, September 29, 2017 10:06 PM
>>To: guoyp@apache.org<ma...@apache.org>; guoiii@outlook.com<ma...@outlook.com>
>>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>>Subject: Griffin Accuracy Doubts
>>
>>Dear William,
>> Good day. Please let us know on the below questions that we have
>>
>>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>2. Would Griffin support the result metric col wise instead of overall matched count.
>>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>>
>>
>>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>>
>>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>>but how to provide better user experience from front end for UDF, we are still considering it.
>>
>>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>>
>>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>>
>>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>>
>>
>>select col from src order by prikey
>>select col from tgt order by prikey
>>
>>src.col vs tgt.col instead of the join key . Please let us know.
>>
>>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>>
>>Thank you very much.
>>
>>
>>
>>Thanks & Regards
>>M. Vidyaranya
>>NDX
>>TCS - Nielsen Relationship
>>Mobile  : +91 9790591311
>
>
>
>【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  <http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>
>
>

Re:Re: Re: Re: Griffin Accuracy Doubts

Posted by Lionel Liu <bh...@163.com>.

Hi Ananthanarayanan.M.S,
Sorry for the late reply, I think you are trying accuracy measure by configuring the field "dsl.type" as "spark-sql", not as "griffin-dsl", right?
If you just want to use accuracy, you'd better try "griffin-dsl", and configure "dq.type" as "accuracy", with "rule" like this "source.name = target.name and source.age = target.age".
For "spark-sql" type, we add a column "__tmst" into the sql, and group by it, to fit some streaming cases, but it does need some background knowledge, for example, like accuracy case, when there're multiple data sources, sql engine can not recognize the column "__tmst" belongs to source or target, so it is ambiguous, we'll fix this bug in later version.
After all, for accuracy cases, we recommend you use the configuration of rule like this:
{
  "dsl.type": "griffin-dsl",
  "dq.type": "accuracy",
  "rule": "source.name = target.name and source.age = target.age"
}




Hope this can help you, thanks a lot.




--

Regards,
Lionel, Liu

At 2017-11-21 20:42:45, "Ananthanarayanan Ms" <an...@nielsen.com> wrote:

Hi Lionel,
 Even after changing, the issue still remains and looks like during join as both src and tgt are already populated withColumn __tmst hence here it shows '__tmst' is ambiguous. Could you please help to check and clarify.


17/11/21 08:38:46 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,source.s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND target.market_id IS NULL AND target.period_id IS NULL GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 566
 




Regards,
Ananthanarayanan.M.S


On Tue, Nov 21, 2017 at 8:05 AM, Lionel Liu <bh...@163.com> wrote:

Hi Ananthanarayanan.M.S,


Actually, we add a new column `__tmst` for each table, to persist the timestamp of each data row.
But the sql you list seems not like our exact logic, so I want to see your accuracy rule statement to have a check.


if your accuracy rule statement is like:
"source.product_id = target.product_id, source.market_id = target.market_id, source.period_id = target.period_id",
the sql for that step should be:
"SELECT source.* FROM source LEFT JOIN target ON coalesce(source. product_id, 'null') = coalesce(target. product_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source. period_id, 'null') = coalesce(target. period_id, 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL".


Hope this can help you, thanks.



--

Regards,
Lionel, Liu



At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote:
>hello Ananthanarayanan.M.S,
>
>We support some partitioned tables, could you show us your partitioned table description for check and some sample data will be helpful?
>
>Thanks,
>William
>________________________________
>From: Ananthanarayanan Ms <an...@nielsen.com>
>Sent: Tuesday, November 21, 2017 3:53 AM
>To: William Guo
>Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Hello William / Lionel,
>Could you please let us know if we can use 1.6. Could you please let know if the partitioned tables alone can be input as src/tgt as below err is got when we build and run the master code.
>
>17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL )) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 413
>
>
>
>Regards,
>Ananthanarayanan.M.S
>
>On Wed, Oct 4, 2017 at 2:39 PM, William Guo <gu...@apache.org>> wrote:
>
>hi Abishek,
>
>
>For SQL support, the version should be 0.1.6, we will release in Middle of October.
>
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Abishek Kunduru <ab...@nielsen.com>>
>Sent: Tuesday, October 3, 2017 12:35 AM
>To: 刘力力
>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Vidyaranya Malladi; guoyp@apache.org<ma...@apache.org>; Mara Preotescu; Ananthanarayanan Ms
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Thanks William and Lionel for the quick response. Can you please tell us when would the Griffin with SQL support be released and its version number?
>
>Regarding the second question, Vidya is not referring to changing the source or target data but instead DQ feature to compare source and target columns and output the difference in terms of percentage change (Vidya gave an example as well in the below table which outputs the difference between source and target numeric columns).
>
>Best,
>Abishek
>
>
>On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 <bh...@163.com>> wrote:
>Hi Vidyaranya,
>
>I have some more comments about the answers from William. Pls check the following comments.
>
>
>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>
>
>--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still considering it.
>
>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>
>--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.
>
>Thanks very much.
>
>
>--
>Regards,
>Lionel, Liu
>
>
>At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com>> wrote:
>>hi Vidyaranya,
>>
>>
>>Please check my comments inline.
>>
>>
>>Thanks,
>>
>>William
>>
>>
>>________________________________
>>From: Vidyaranya Malladi <vi...@nielsen.com>>
>>Sent: Friday, September 29, 2017 10:06 PM
>>To: guoyp@apache.org<ma...@apache.org>; guoiii@outlook.com<ma...@outlook.com>
>>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>>Subject: Griffin Accuracy Doubts
>>
>>Dear William,
>> Good day. Please let us know on the below questions that we have
>>
>>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>2. Would Griffin support the result metric col wise instead of overall matched count.
>>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>>
>>
>>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>>
>>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>>but how to provide better user experience from front end for UDF, we are still considering it.
>>
>>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>>
>>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>>
>>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>>
>>
>>select col from src order by prikey
>>select col from tgt order by prikey
>>
>>src.col vs tgt.col instead of the join key . Please let us know.
>>
>>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>>
>>Thank you very much.
>>
>>
>>
>>Thanks & Regards
>>M. Vidyaranya
>>NDX
>>TCS - Nielsen Relationship
>>Mobile  : +91 9790591311
>
>
>
>【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  <http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>
>
>

Re:Re: Re: Griffin Accuracy Doubts

Posted by Lionel Liu <bh...@163.com>.

Hi Ananthanarayanan.M.S,


Actually, we add a new column `__tmst` for each table, to persist the timestamp of each data row.
But the sql you list seems not like our exact logic, so I want to see your accuracy rule statement to have a check.


if your accuracy rule statement is like:
"source.product_id = target.product_id, source.market_id = target.market_id, source.period_id = target.period_id",
the sql for that step should be:
"SELECT source.* FROM source LEFT JOIN target ON coalesce(source. product_id, 'null') = coalesce(target. product_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source. period_id, 'null') = coalesce(target. period_id, 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS NULL".


Hope this can help you, thanks.



--

Regards,
Lionel, Liu



At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote:
>hello Ananthanarayanan.M.S,
>
>We support some partitioned tables, could you show us your partitioned table description for check and some sample data will be helpful?
>
>Thanks,
>William
>________________________________
>From: Ananthanarayanan Ms <an...@nielsen.com>
>Sent: Tuesday, November 21, 2017 3:53 AM
>To: William Guo
>Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Hello William / Lionel,
>Could you please let us know if we can use 1.6. Could you please let know if the partitioned tables alone can be input as src/tgt as below err is got when we build and run the master code.
>
>17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL )) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 413
>
>
>
>Regards,
>Ananthanarayanan.M.S
>
>On Wed, Oct 4, 2017 at 2:39 PM, William Guo <gu...@apache.org>> wrote:
>
>hi Abishek,
>
>
>For SQL support, the version should be 0.1.6, we will release in Middle of October.
>
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Abishek Kunduru <ab...@nielsen.com>>
>Sent: Tuesday, October 3, 2017 12:35 AM
>To: 刘力力
>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Vidyaranya Malladi; guoyp@apache.org<ma...@apache.org>; Mara Preotescu; Ananthanarayanan Ms
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Thanks William and Lionel for the quick response. Can you please tell us when would the Griffin with SQL support be released and its version number?
>
>Regarding the second question, Vidya is not referring to changing the source or target data but instead DQ feature to compare source and target columns and output the difference in terms of percentage change (Vidya gave an example as well in the below table which outputs the difference between source and target numeric columns).
>
>Best,
>Abishek
>
>
>On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 <bh...@163.com>> wrote:
>Hi Vidyaranya,
>
>I have some more comments about the answers from William. Pls check the following comments.
>
>
>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>
>
>--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still considering it.
>
>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>
>--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.
>
>Thanks very much.
>
>
>--
>Regards,
>Lionel, Liu
>
>
>At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com>> wrote:
>>hi Vidyaranya,
>>
>>
>>Please check my comments inline.
>>
>>
>>Thanks,
>>
>>William
>>
>>
>>________________________________
>>From: Vidyaranya Malladi <vi...@nielsen.com>>
>>Sent: Friday, September 29, 2017 10:06 PM
>>To: guoyp@apache.org<ma...@apache.org>; guoiii@outlook.com<ma...@outlook.com>
>>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>>Subject: Griffin Accuracy Doubts
>>
>>Dear William,
>> Good day. Please let us know on the below questions that we have
>>
>>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>2. Would Griffin support the result metric col wise instead of overall matched count.
>>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>>
>>
>>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>>
>>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>>but how to provide better user experience from front end for UDF, we are still considering it.
>>
>>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>>
>>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>>
>>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>>
>>
>>select col from src order by prikey
>>select col from tgt order by prikey
>>
>>src.col vs tgt.col instead of the join key . Please let us know.
>>
>>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>>
>>Thank you very much.
>>
>>
>>
>>Thanks & Regards
>>M. Vidyaranya
>>NDX
>>TCS - Nielsen Relationship
>>Mobile  : +91 9790591311
>
>
>
>【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  <http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>
>
>

Re: Re: Griffin Accuracy Doubts

Posted by William Guo <gu...@apache.org>.

hello Ananthanarayanan.M.S,

We support some partitioned tables, could you show us your partitioned table description for check and some sample data will be helpful?

Thanks,
William
________________________________
From: Ananthanarayanan Ms <an...@nielsen.com>
Sent: Tuesday, November 21, 2017 3:53 AM
To: William Guo
Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
Subject: Re: Re: Griffin Accuracy Doubts

Hello William / Lionel,
Could you please let us know if we can use 1.6. Could you please let know if the partitioned tables alone can be input as src/tgt as below err is got when we build and run the master code.

17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` AS `__tmst`, source.product_id, source.market_id, source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL )) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 413



Regards,
Ananthanarayanan.M.S

On Wed, Oct 4, 2017 at 2:39 PM, William Guo <gu...@apache.org>> wrote:

hi Abishek,


For SQL support, the version should be 0.1.6, we will release in Middle of October.



Thanks,

William


________________________________
From: Abishek Kunduru <ab...@nielsen.com>>
Sent: Tuesday, October 3, 2017 12:35 AM
To: 刘力力
Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Vidyaranya Malladi; guoyp@apache.org<ma...@apache.org>; Mara Preotescu; Ananthanarayanan Ms
Subject: Re: Re: Griffin Accuracy Doubts

Thanks William and Lionel for the quick response. Can you please tell us when would the Griffin with SQL support be released and its version number?

Regarding the second question, Vidya is not referring to changing the source or target data but instead DQ feature to compare source and target columns and output the difference in terms of percentage change (Vidya gave an example as well in the below table which outputs the difference between source and target numeric columns).

Best,
Abishek


On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 <bh...@163.com>> wrote:
Hi Vidyaranya,

I have some more comments about the answers from William. Pls check the following comments.


1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well

--Yes, Since we have change our backend to fully support sql.

2. Would Griffin support the result metric col wise instead of overall matched count.
eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100

--Yes, Since we have change our backend to fully support sql.

Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00


--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.


3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals

--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
but how to provide better user experience from front end for UDF, we are still considering it.

4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc

--Support SQL is already done, will have a quick release after we fully tested and reviewed.

5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.


select col from src order by prikey
select col from tgt order by prikey

src.col vs tgt.col instead of the join key . Please let us know.

-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.

--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.

Thanks very much.


--
Regards,
Lionel, Liu


At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com>> wrote:
>hi Vidyaranya,
>
>
>Please check my comments inline.
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Vidyaranya Malladi <vi...@nielsen.com>>
>Sent: Friday, September 29, 2017 10:06 PM
>To: guoyp@apache.org<ma...@apache.org>; guoiii@outlook.com<ma...@outlook.com>
>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>Subject: Griffin Accuracy Doubts
>
>Dear William,
> Good day. Please let us know on the below questions that we have
>
>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still considering it.
>
>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>
>Thank you very much.
>
>
>
>Thanks & Regards
>M. Vidyaranya
>NDX
>TCS - Nielsen Relationship
>Mobile  : +91 9790591311



【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  <http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>

Re: Re: Griffin Accuracy Doubts

Posted by William Guo <gu...@apache.org>.

hi Abishek,


For SQL support, the version should be 0.1.6, we will release in Middle of October.



Thanks,

William


________________________________
From: Abishek Kunduru <ab...@nielsen.com>
Sent: Tuesday, October 3, 2017 12:35 AM
To: 刘力力
Cc: dev@griffin.incubator.apache.org; Vidyaranya Malladi; guoyp@apache.org; Mara Preotescu; Ananthanarayanan Ms
Subject: Re: Re: Griffin Accuracy Doubts

Thanks William and Lionel for the quick response. Can you please tell us when would the Griffin with SQL support be released and its version number?

Regarding the second question, Vidya is not referring to changing the source or target data but instead DQ feature to compare source and target columns and output the difference in terms of percentage change (Vidya gave an example as well in the below table which outputs the difference between source and target numeric columns).

Best,
Abishek


On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 <bh...@163.com>> wrote:
Hi Vidyaranya,

I have some more comments about the answers from William. Pls check the following comments.


1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well

--Yes, Since we have change our backend to fully support sql.

2. Would Griffin support the result metric col wise instead of overall matched count.
eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100

--Yes, Since we have change our backend to fully support sql.

Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00


--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.


3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals

--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
but how to provide better user experience from front end for UDF, we are still considering it.

4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc

--Support SQL is already done, will have a quick release after we fully tested and reviewed.

5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.


select col from src order by prikey
select col from tgt order by prikey

src.col vs tgt.col instead of the join key . Please let us know.

-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.

--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.

Thanks very much.


--
Regards,
Lionel, Liu


At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com>> wrote:
>hi Vidyaranya,
>
>
>Please check my comments inline.
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Vidyaranya Malladi <vi...@nielsen.com>>
>Sent: Friday, September 29, 2017 10:06 PM
>To: guoyp@apache.org<ma...@apache.org>; guoiii@outlook.com<ma...@outlook.com>
>Cc: dev@griffin.incubator.apache.org<ma...@griffin.incubator.apache.org>; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>Subject: Griffin Accuracy Doubts
>
>Dear William,
> Good day. Please let us know on the below questions that we have
>
>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still considering it.
>
>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>
>Thank you very much.
>
>
>
>Thanks & Regards
>M. Vidyaranya
>NDX
>TCS - Nielsen Relationship
>Mobile  : +91 9790591311



【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  <http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>

Re:Re: Griffin Accuracy Doubts

Posted by 刘力力 <bh...@163.com>.

Hi Vidyaranya,


I have some more comments about the answers from William. Pls check the following comments.


1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well

--Yes, Since we have change our backend to fully support sql.

2. Would Griffin support the result metric col wise instead of overall matched count.
eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100

--Yes, Since we have change our backend to fully support sql.

Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00


--[commont from Lionel] Actually, I think the modification of data source is like some ETL task, not the DQ domain, we consider that the DQ result from griffin should be simple like a value or percentage or group of it, that might be better for DQ check.


3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals

--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
but how to provide better user experience from front end for UDF, we are still considering it.

4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc

--Support SQL is already done, will have a quick release after we fully tested and reviewed.

5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.


select col from src order by prikey
select col from tgt order by prikey

src.col vs tgt.col instead of the join key . Please let us know.

-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
--[comment from Lionel] We support sql directly, if you can describe your case in sql, we can do that.


Thanks very much.



--

Regards,
Lionel, Liu



At 2017-09-30 08:23:39, "William Guo" <gu...@outlook.com> wrote:
>hi Vidyaranya,
>
>
>Please check my comments inline.
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Vidyaranya Malladi <vi...@nielsen.com>
>Sent: Friday, September 29, 2017 10:06 PM
>To: guoyp@apache.org; guoiii@outlook.com
>Cc: dev@griffin.incubator.apache.org; Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>Subject: Griffin Accuracy Doubts
>
>Dear William,
> Good day. Please let us know on the below questions that we have
>
>1. Does Griffin plan to support multi table being src eg. src table frame from a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user wants to check the accuracy of col4 which is 'price' in src and destination by having a new col added to the src table as '%change' which would be abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1       25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales Units (no promo)  1       17172.00        1717.25 0.00
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table from mainframe format where dots/decimals are ignore and we have to apply a logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still considering it.
>
>4. Last time we discussed few features were said to be discussed to support could you please let us know if we have any road map for the same. like advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to enhance to check if the user could select few cols other than the mentioned in joins and we need to compare the src col's value vs dest col's value lets say without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with tgt.col, we don't need join key if you can elaborate mapping in sql since we are fully supporting sql.
>
>Thank you very much.
>
>
>
>Thanks & Regards
>M. Vidyaranya
>NDX
>TCS - Nielsen Relationship
>Mobile  : +91 9790591311