You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jamal sasha <ja...@gmail.com> on 2012/11/14 19:03:17 UTC

removing lines with missing values

Hi
I have dataset in some form

F1, f2......fn

Now sometimes f1 is empty sometimes f2 and so on
Basically what I want is anytime any field is empty ignore that entry.
Now one way to do is using filter f1!='' and so on.
But that would be an ugly statement.
Is there a better way to do this

Re: removing lines with missing values

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Jamal,

If any fields are empty in an input file, they will be loaded as nulls by
Pig. For example,

,f2,f3,,f5

will be loaded as

(null,f2,f3,null,f5) // when you dump it, it will be printed as
(,f2,f3,,f5).

Now you can use COUNT(*) to count the number of non-null elements in a bag
and use a condition "COUNT(*) != total # of columns" to filter out rows
that have nulls. Here is an example:

a = LOAD '1.txt' USING PigStorage(',') AS (i,j,k);
b = FOREACH a GENERATE *, TOBAG(*) AS aBag;
c = FILTER b BY COUNT(aBag) == 3;
d = FOREACH c GENERATE i,j,k;
DUMP d;

With the following input file:

1,2,3
,2,3
1,,3

This gives me:

(1,2,3)

Alternatively, you can easily write a UDF that takes a tuple or a bag and
checks whether any element is null.

Thanks,
Cheolsoo

On Wed, Nov 14, 2012 at 10:03 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi
> I have dataset in some form
>
> F1, f2......fn
>
> Now sometimes f1 is empty sometimes f2 and so on
> Basically what I want is anytime any field is empty ignore that entry.
> Now one way to do is using filter f1!='' and so on.
> But that would be an ugly statement.
> Is there a better way to do this
>