You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Amit Mittal <am...@gmail.com> on 2014/08/27 04:34:35 UTC

Filter data set by applying rules dynamically

Hi All,

I have a data set in text csv files and are compressed using gzip
compression. Each record is having around 100 fields. I need to filter the
data by applying various checks like "1. type of field", "2. nullable?",
"3. min & max length", "4. value belongs to predefined list", "5. value
substitution". In total there are around 200 checks in one data set. Like
this there are 5 data sets.

If it would have been few checks only, I could have used simple Pig script
with filter/UDF or Map  reduce program. However it is not a good way to
have all these checks in script/UDF/MR program.

One way I can think of is to use a JSON or Java class to encapsulate all
these checks. Then invoke them dynamically using reflection API to filter
the record in UDF. However this may lead to performance issue and does not
seem to be optimized solution.

Since this looks like a common use case, I request your opinion to
accomplish this. I can use MR/Pig/Hive to do this.

Thanks
Amit

Re: Filter data set by applying rules dynamically

Posted by Rodrigo Ferreira <we...@gmail.com>.
Can you be more specific? I didn't understand exactly what you need. I
would say, though, that a customized Pig UDF should do the job.

With more info, I can try to give you a better idea of what I mean.

Rodrigo.


2014-08-27 4:34 GMT+02:00 Amit Mittal <am...@gmail.com>:

> Hi All,
>
> I have a data set in text csv files and are compressed using gzip
> compression. Each record is having around 100 fields. I need to filter the
> data by applying various checks like "1. type of field", "2. nullable?",
> "3. min & max length", "4. value belongs to predefined list", "5. value
> substitution". In total there are around 200 checks in one data set. Like
> this there are 5 data sets.
>
> If it would have been few checks only, I could have used simple Pig script
> with filter/UDF or Map  reduce program. However it is not a good way to
> have all these checks in script/UDF/MR program.
>
> One way I can think of is to use a JSON or Java class to encapsulate all
> these checks. Then invoke them dynamically using reflection API to filter
> the record in UDF. However this may lead to performance issue and does not
> seem to be optimized solution.
>
> Since this looks like a common use case, I request your opinion to
> accomplish this. I can use MR/Pig/Hive to do this.
>
> Thanks
> Amit
>

Fwd: Filter data set by applying rules dynamically

Posted by Divya shree <di...@gmail.com>.
Hi All,

I have a data set in text csv files and are compressed using gzip
compression. Each record is having around 100 fields. I need to filter the
data by applying various checks like "1. type of field", "2. nullable?",
"3. min & max length", "4. value belongs to predefined list", "5. value
substitution". In total there are around 200 checks in one data set. Like
this there are 5 data sets.

If it would have been few checks only, I could have used simple Pig script
with filter/UDF or Map  reduce program. However it is not a good way to
have all these checks in script/UDF/MR program.

One way I can think of is to use a JSON or Java class to encapsulate all
these checks. Then invoke them dynamically using reflection API to filter
the record in UDF. However this may lead to performance issue and does not
seem to be optimized solution.

Since this looks like a common use case, I request your opinion to
accomplish this. I can use MR/Pig/Hive to do this.

Thanks
Divyashree





-- 
Regards
Divyashree