You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@griffin.apache.org by "Chitral Verma (Jira)" <ji...@apache.org> on 2021/03/18 16:19:00 UTC

[jira] [Commented] (GRIFFIN-358) Rewrite the Rule/Measure implementations

    [ https://issues.apache.org/jira/browse/GRIFFIN-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304255#comment-17304255 ] 

Chitral Verma commented on GRIFFIN-358:
---------------------------------------

[~wankun] [~guoyp] do you have any points regarding this ?

 

> Rewrite the Rule/Measure implementations
> ----------------------------------------
>
>                 Key: GRIFFIN-358
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-358
>             Project: Griffin
>          Issue Type: New Feature
>            Reporter: Chitral Verma
>            Assignee: Chitral Verma
>            Priority: Major
>
> Current `RuleParams` can be of the following 3 DSL types,
>  * Data Ops (for source preprocessing)
>  * Griffin DSL
>  * SparkSQL
> GriffinDSL allows the implementation of measures (DQ Types) like Completeness, Accuracy, etc.
> To enable such measures there is an extensive implementation of expression, task hierarchies, parsing and most of this is heavily dependent on scala-parser-combinators.
> At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like query but substitution of user-defined constraints.
> This approach has some drawbacks,
>  * Suboptimal processing. While the transformation steps execute in parallel on the driver, the data set is still scanned multiple times in parallel which can cause inefficiencies on the SparkSession side and the internal task scheduler was single-threaded. Even though the data set can be cached, still it branched and crucial memory is required for holding the dataset rather than processing it.
>  * Internal functions of Spark are not used. Data preprocessing has a very limited scope currently even though we have 100s spark SQL functions available for use.
>  * This blocks structured streaming. The manually constructed SQL queries cause multiple aggregations in the same query on a streaming data set which is not supported by Spark's Structured streaming. There are workarounds for this but they all require rewriting the *Expr2DQSteps classes.
>  * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure and SparkSQL are redundant functionalities
> The proposed solution involves SparkSQL DSL based measures and some changes to Rule Params. This will enhance the data pre proc flows and the measures themselves



--
This message was sent by Atlassian Jira
(v8.3.4#803005)