You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Markus Resch <ma...@adtech.de> on 2012/04/10 17:58:18 UTC

Best Practice: LOAD returns null

Hey everyone,

I have a new question about how to handle a very common issue the best:
We have a LOAD statement loading AVRO files using globbing by a given
regex. By some wired reason this might return null as there is no file
matching the regex. 
There are two thinkable cases where this can happen: 
On purpose: There is no data gathered in this e.g. time frame. 
On error: some nasty guy deleted a very important look up table for my
join. Great hint the stuff with the replicated join, btw :).


Do you have any suggestion about how to handle this?

Thanks

Markus


Re: Best Practice: LOAD returns null

Posted by Bill Graham <bi...@gmail.com>.
I'm not entirely following your empty data set proposal, but regardless I
think you should fail hard and fast if part of the glob is missed. IIRC
Hadoop's filesystem API throws an exception when not all glob variants are
met. I'd recommending throwing that up to the user, which should clearly
indicate why you're failing to execute.


On Wed, Apr 11, 2012 at 1:13 AM, Markus Resch <ma...@adtech.de>wrote:

> Thanks, you are perfectly right, the LOAD needs to fail. But how do I
> proceed if it fails? Afaik, I can't return an error to my caller or
> something else? One idea I had was to load an default (empty) data set
> of the given schema and union the result of LOAD with that to get a
> valid empty data set which I can handle normally with an empty result.
> But that looks kind of complicated to me.
>
> Thanks
> Markus
>
> Am Dienstag, den 10.04.2012, 20:53 -0700 schrieb Bill Graham:
> > Typically, file pattern globing is very strict and LOADs fail if not all
> > glob variants are met. This makes sense when you think that someone might
> > pass a glob path with each of the 24 hours in a day. If one of those
> hours
> > doesn't exist you want the LOAD to fail.
> >
> > thanks,
> > Bill
> >
> >
> > On Tue, Apr 10, 2012 at 8:58 AM, Markus Resch <markus.resch@adtech.de
> >wrote:
> >
> > > Hey everyone,
> > >
> > > I have a new question about how to handle a very common issue the best:
> > > We have a LOAD statement loading AVRO files using globbing by a given
> > > regex. By some wired reason this might return null as there is no file
> > > matching the regex.
> > > There are two thinkable cases where this can happen:
> > > On purpose: There is no data gathered in this e.g. time frame.
> > > On error: some nasty guy deleted a very important look up table for my
> > > join. Great hint the stuff with the replicated join, btw :).
> > >
> > >
> > > Do you have any suggestion about how to handle this?
> > >
> > > Thanks
> > >
> > > Markus
> > >
> > >
> >
> >
>
> --
> Markus Resch
> Software Developer
> P: +49 6103-5715-236 | F: +49 6103-5715-111 |
> ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
> www.adtech.com<http://www.adtech.com>
>
> ADTECH | A Division of Advertising.com Group - Residence of the Company:
> Dreieich, Germany - Registration Office: Offenbach, HRB 46021
> Management Board: Erhard Neumann, Mark Thielen
>
> This message contains privileged and confidential information. Any
> dissemination, distribution, copying or other use of this
> message or any of its content (data, prices...) to any third parties may
> only occur with ADTECH's prior consent.
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Best Practice: LOAD returns null

Posted by Markus Resch <ma...@adtech.de>.
Thanks, you are perfectly right, the LOAD needs to fail. But how do I
proceed if it fails? Afaik, I can't return an error to my caller or
something else? One idea I had was to load an default (empty) data set
of the given schema and union the result of LOAD with that to get a
valid empty data set which I can handle normally with an empty result.
But that looks kind of complicated to me.

Thanks
Markus

Am Dienstag, den 10.04.2012, 20:53 -0700 schrieb Bill Graham:
> Typically, file pattern globing is very strict and LOADs fail if not all
> glob variants are met. This makes sense when you think that someone might
> pass a glob path with each of the 24 hours in a day. If one of those hours
> doesn't exist you want the LOAD to fail.
> 
> thanks,
> Bill
> 
> 
> On Tue, Apr 10, 2012 at 8:58 AM, Markus Resch <ma...@adtech.de>wrote:
> 
> > Hey everyone,
> >
> > I have a new question about how to handle a very common issue the best:
> > We have a LOAD statement loading AVRO files using globbing by a given
> > regex. By some wired reason this might return null as there is no file
> > matching the regex.
> > There are two thinkable cases where this can happen:
> > On purpose: There is no data gathered in this e.g. time frame.
> > On error: some nasty guy deleted a very important look up table for my
> > join. Great hint the stuff with the replicated join, btw :).
> >
> >
> > Do you have any suggestion about how to handle this?
> >
> > Thanks
> >
> > Markus
> >
> >
> 
> 

-- 
Markus Resch
Software Developer 
P: +49 6103-5715-236 | F: +49 6103-5715-111 |
ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021
Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any
dissemination, distribution, copying or other use of this
message or any of its content (data, prices...) to any third parties may
only occur with ADTECH's prior consent.


Re: Best Practice: LOAD returns null

Posted by Bill Graham <bi...@gmail.com>.
Typically, file pattern globing is very strict and LOADs fail if not all
glob variants are met. This makes sense when you think that someone might
pass a glob path with each of the 24 hours in a day. If one of those hours
doesn't exist you want the LOAD to fail.

thanks,
Bill


On Tue, Apr 10, 2012 at 8:58 AM, Markus Resch <ma...@adtech.de>wrote:

> Hey everyone,
>
> I have a new question about how to handle a very common issue the best:
> We have a LOAD statement loading AVRO files using globbing by a given
> regex. By some wired reason this might return null as there is no file
> matching the regex.
> There are two thinkable cases where this can happen:
> On purpose: There is no data gathered in this e.g. time frame.
> On error: some nasty guy deleted a very important look up table for my
> join. Great hint the stuff with the replicated join, btw :).
>
>
> Do you have any suggestion about how to handle this?
>
> Thanks
>
> Markus
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*