You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@griffin.apache.org by Senthil Nathan <se...@gmail.com> on 2019/03/11 06:42:07 UTC

Multiple source for accuracy measure

Hi Team,

Is there a way in griffin that more than one data sources be mentioned for
the accuracy measure. I have a scenario where we combine two table hive
records with some business logic and loaded into a target table.
This is how i gave a try and it didnt work.

{
  "name": "batch_accu",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "src_1",
      "baseline": true,
      "connectors": [
        {
          "type": "hive",
          "version": "1.2",
          "config": {
            "database": "dq_check",
            "table.name": "src_tbl_1"
          }
        }
      ]
    },
    {
      "name": "src_2",
      "baseline": true,
      "connectors": [
        {
          "type": "hive",
          "version": "1.2",
          "config": {
            "database": "dq_check",
            "table.name": "src_tbl_2"
          }
        }
      ]
    },
    {
      "name": "tgt_1",
      "connectors": [
        {

          "type": "hive",
          "version": "1.2",
          "config": {
            "database": "dq_check",
            "table.name": "tgt_tbl_1"
          }
        }
      ]
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "out.dataframe.name": "accu",
        "rule": "src_1.id = src_2.id AND src_2.id = tgt_1.id and src_1.name=
tgt_1.name and src_2.mgr=tgt_1.mgr",
        "details": {
          "source": [ "src_1","src_tbl_2"],
          "target": "tgt_1",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out": [
          {
            "type": "metric",
            "name": "accu"
          },
          {
            "type": "record",
            "name": "missRecords"
          }
        ]
      }
    ]
  },
  "sinks": ["CONSOLE","HDFS"]
}

data set:

src_1
id,age,name
1|30|senthil
2|39|mukunth
3|20|guru
4|21|pradeep

src_2
id,mgr
1|durga
3|senthilv


tgt_1
id,name,mgr
1|senthil|durga
3|guru|senthil


I want to build the Griffin-DSL or spark-sql based accuracy measure rule
that would do the same.

Skimmed the code base quickly and i don't see it to be since both source
and target been referred as String.

def getDQSteps(): Seq[DQStep] = {
    val details = ruleParam.getDetails
    val accuracyExpr = expr.asInstanceOf[LogicalExpr]

    val sourceName = details.getString(_source,
context.getDataSourceName(0))
    val targetName = details.getString(_target,
context.getDataSourceName(1))
    val analyzer = AccuracyAnalyzer(accuracyExpr, sourceName, targetName)

    val procType = context.procType
    val timestamp = context.contextId.timestamp

    if (!context.runTimeTableRegister.existsTable(sourceName)) {
      warn(s"[${timestamp}] data source ${sourceName} not exists")



Incase if there is a support can u help with an example so that it'll be
easy.

Thanks & Regards,
Senthil Nathan