You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dexin Wang <wa...@gmail.com> on 2012/03/02 01:45:28 UTC
filter out null lines returned by UDF
Hi,
I have a UDF that parses a line and then return a bag, and sometimes the
line is bad so I'm returning null in the UDF. In my pig script, I'd like to
filter those nulls like this:
raw = LOAD 'raw_input' AS (line:chararray);
parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields in
the tuple: id and name
DUMP parsed;
(id1,name1)
(id2,name2)
()
(id3,name3)
parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
DUMP parsed_no_nulls;
(id1,name1)
(id2,name2)
(id3,name3)
This works, but I'm getting this warning:
WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
-
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
Attempt to access field which was not found in the input
When I try to use IsEmpty to filter, I get this error "Cannot test a NULL
for emptiness".
What's the correct way to filter out these null bags returned from my UDF?
Thanks.
Dexin
Re: filter out null lines returned by UDF
Posted by Dexin Wang <wa...@gmail.com>.
yeah. That works great. Thanks you Jonathan.
On Thu, Mar 1, 2012 at 5:14 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but
> if you FLATTEN a bag that is empty (ie size=0), it will throw away the row.
> I would have your UDF return an empty bag and let the flatten wipe it out.
>
> 2012/3/1 Dexin Wang <wa...@gmail.com>
>
> > Hi,
> >
> > I have a UDF that parses a line and then return a bag, and sometimes the
> > line is bad so I'm returning null in the UDF. In my pig script, I'd like
> to
> > filter those nulls like this:
> >
> > raw = LOAD 'raw_input' AS (line:chararray);
> > parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields
> in
> > the tuple: id and name
> > DUMP parsed;
> >
> > (id1,name1)
> > (id2,name2)
> > ()
> > (id3,name3)
> >
> > parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
> > DUMP parsed_no_nulls;
> >
> > (id1,name1)
> > (id2,name2)
> > (id3,name3)
> >
> > This works, but I'm getting this warning:
> >
> > WARN
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
> > -
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
> > Attempt to access field which was not found in the input
> >
> > When I try to use IsEmpty to filter, I get this error "Cannot test a NULL
> > for emptiness".
> >
> > What's the correct way to filter out these null bags returned from my
> UDF?
> >
> > Thanks.
> > Dexin
> >
>
Re: filter out null lines returned by UDF
Posted by Jonathan Coveney <jc...@gmail.com>.
FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but
if you FLATTEN a bag that is empty (ie size=0), it will throw away the row.
I would have your UDF return an empty bag and let the flatten wipe it out.
2012/3/1 Dexin Wang <wa...@gmail.com>
> Hi,
>
> I have a UDF that parses a line and then return a bag, and sometimes the
> line is bad so I'm returning null in the UDF. In my pig script, I'd like to
> filter those nulls like this:
>
> raw = LOAD 'raw_input' AS (line:chararray);
> parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields in
> the tuple: id and name
> DUMP parsed;
>
> (id1,name1)
> (id2,name2)
> ()
> (id3,name3)
>
> parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
> DUMP parsed_no_nulls;
>
> (id1,name1)
> (id2,name2)
> (id3,name3)
>
> This works, but I'm getting this warning:
>
> WARN
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
> -
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
> Attempt to access field which was not found in the input
>
> When I try to use IsEmpty to filter, I get this error "Cannot test a NULL
> for emptiness".
>
> What's the correct way to filter out these null bags returned from my UDF?
>
> Thanks.
> Dexin
>