You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@griffin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/10/08 09:18:00 UTC
[jira] [Work logged] (GRIFFIN-289) new feature for griffin COMPLETENESS dq type

     [ https://issues.apache.org/jira/browse/GRIFFIN-289?focusedWorklogId=324961&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324961 ]

ASF GitHub Bot logged work on GRIFFIN-289:
------------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Oct/19 09:17
            Start Date: 08/Oct/19 09:17
    Worklog Time Spent: 10m 
      Work Description: LittleZhao commented on pull request #538: [GRIFFIN-289]New feature for griffin COMPLETENESS dq type
URL: https://github.com/apache/griffin/pull/538#discussion_r332411787
 
 

 ##########
 File path: measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/CompletenessExpr2DQSteps.scala
 ##########
 @@ -167,4 +175,46 @@ case class CompletenessExpr2DQSteps(context: DQContext,
     }
   }
 
+  /**
+    * get 'error' where clause
+    * @param errorConfs error configuraion list
+    * @return 'error' where clause
+    */
+  def getErrorConfCompleteWhereClause(errorConfs: Seq[RuleErrorConfParam]): String = {
+    errorConfs.map(errorConf => this.getEachErrorWhereClause(errorConf)).mkString(" OR ")
+  }
+
+  /**
+    * get error sql for each column
+    * @param errorConf  error configuration
+    * @return 'error' sql for each column
+    */
+  def getEachErrorWhereClause(errorConf: RuleErrorConfParam): String = {
+    val errorType: Option[String] = errorConf.getErrorType
+    val columnName: String = errorConf.getColumnName.get
+    if ("regex".equalsIgnoreCase(errorType.get)) {
+      // only have one regular expression
+      val regexValue: String = errorConf.getValues.apply(0)
+      val afterReplace: String = regexValue.replaceAll("""\\""", """\\\\""")
+      val result: String = s"`${columnName}` REGEXP '${afterReplace}'"
+      return result
+    } else if ("enumeration".equalsIgnoreCase(errorType.get)) {
+      val values: Seq[String] = errorConf.getValues
+      // hive_none means None
+      var hasNone: Boolean = false
+      if (values.contains("hive_none")) {
+        hasNone = true
+      }
+
+      val valueWithQuote: String = values.filter(value => !"hive_none".equals(value))
+        .map(value => s"'${value}'").mkString(", ")
 
 Review comment:
   @guoyuepeng  could you please close this PR, or is there anything should I do?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 324961)
    Time Spent: 2h 40m  (was: 2.5h)

> new feature for griffin COMPLETENESS dq type
> --------------------------------------------
>
>                 Key: GRIFFIN-289
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-289
>             Project: Griffin
>          Issue Type: New Feature
>          Components: completeness-batch
>    Affects Versions: 0.3.1-incubating
>            Reporter: Zhao Li
>            Priority: Major
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hello
>  
> Now we use griffin measure module to check batch data quality. In COMPLETENESS dq type, griffin checks how many incomplete records in table, and griffin only check if one column is 'null' or not.
>  
> However, only "null" is not enough to consider whether one column is invalid or not. In our condition, analysts may consider other value is invalid even though they are not "null". For example, one column named "company", if company in ("a", "b", "c"), this record is invalid.
>  
> Here we need two ways for user to filter incomplete record, one is "enumeration", users write all invalid values they think for one column; the other is "regular expression", users write regular expression to match invalid values for one column.
>  
> Could griffin updates COMPLETENESS dq type to support our "enumeration" and "regular expression" way to filter incomplete records?
>  
> Regards
>  
> Zhao



--
This message was sent by Atlassian Jira
(v8.3.4#803005)