You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@griffin.apache.org by Senthil Nathan <se...@gmail.com> on 2019/03/11 06:42:07 UTC
Multiple source for accuracy measure
Hi Team,
Is there a way in griffin that more than one data sources be mentioned for
the accuracy measure. I have a scenario where we combine two table hive
records with some business logic and loaded into a target table.
This is how i gave a try and it didnt work.
{
"name": "batch_accu",
"process.type": "batch",
"data.sources": [
{
"name": "src_1",
"baseline": true,
"connectors": [
{
"type": "hive",
"version": "1.2",
"config": {
"database": "dq_check",
"table.name": "src_tbl_1"
}
}
]
},
{
"name": "src_2",
"baseline": true,
"connectors": [
{
"type": "hive",
"version": "1.2",
"config": {
"database": "dq_check",
"table.name": "src_tbl_2"
}
}
]
},
{
"name": "tgt_1",
"connectors": [
{
"type": "hive",
"version": "1.2",
"config": {
"database": "dq_check",
"table.name": "tgt_tbl_1"
}
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"out.dataframe.name": "accu",
"rule": "src_1.id = src_2.id AND src_2.id = tgt_1.id and src_1.name=
tgt_1.name and src_2.mgr=tgt_1.mgr",
"details": {
"source": [ "src_1","src_tbl_2"],
"target": "tgt_1",
"miss": "miss_count",
"total": "total_count",
"matched": "matched_count"
},
"out": [
{
"type": "metric",
"name": "accu"
},
{
"type": "record",
"name": "missRecords"
}
]
}
]
},
"sinks": ["CONSOLE","HDFS"]
}
data set:
src_1
id,age,name
1|30|senthil
2|39|mukunth
3|20|guru
4|21|pradeep
src_2
id,mgr
1|durga
3|senthilv
tgt_1
id,name,mgr
1|senthil|durga
3|guru|senthil
I want to build the Griffin-DSL or spark-sql based accuracy measure rule
that would do the same.
Skimmed the code base quickly and i don't see it to be since both source
and target been referred as String.
def getDQSteps(): Seq[DQStep] = {
val details = ruleParam.getDetails
val accuracyExpr = expr.asInstanceOf[LogicalExpr]
val sourceName = details.getString(_source,
context.getDataSourceName(0))
val targetName = details.getString(_target,
context.getDataSourceName(1))
val analyzer = AccuracyAnalyzer(accuracyExpr, sourceName, targetName)
val procType = context.procType
val timestamp = context.contextId.timestamp
if (!context.runTimeTableRegister.existsTable(sourceName)) {
warn(s"[${timestamp}] data source ${sourceName} not exists")
Incase if there is a support can u help with an example so that it'll be
easy.
Thanks & Regards,
Senthil Nathan