You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Amit Mittal <am...@gmail.com> on 2014/08/26 19:37:27 UTC

Filter data set by applying many rules

Hi All,

I have a data set in text csv files and are compressed using gzip
compression. Each record is having around 100 fields. I need to filter the
data by applying various checks like "1. type of field", "2. nullable?",
"3. min & max length", "4. value belongs to predefined list", "5. value
substitution". In total there are around 200 checks in one data set. Like
this there are 5 data sets.

If it would have been few checks only, I could have used simple Pig script
with filter/UDF or Map  reduce program. However it is not a good way to
have all these checks in script/UDF/MR program.
One way I can think of is to use a JSON to encapsulate all these checks.
Then invoke them dynamically using reflection API to filter the record.
However this may lead to performance issue and does not seem to be
optimized solution.

Since this looks like a common use case, I request your opinion to
accomplish this. I can use MR/Pig/Hive to do this.

Thanks
Amit