You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mario Lassnig <ma...@lassnig.net> on 2012/06/07 13:37:36 UTC

Replace null with string

Hello,

I'm having a lot of null entries in my data. Due to later processing it would be 
very helpful if I could set a default value for null to be the string "other". I 
couldn't find a way to do this (version 0.8.1-cdh3u4)

Also, I have some variables in my GENERATE statements that can potentially 
return null, and I would need something similar to the SQL DECODE function to 
get the "other" string instead of null.

Example:

tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) 
AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, 
'^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, 
rnbfiles, rlength, name;

Here: 'site' and 'datatype' could return an empty string (which is valid) and is 
interpreted as null, but should be "other" instead.

Thanks a lot,
Mario

Re: Replace null with string

Posted by Russell Jurney <ru...@gmail.com>.
fixed = FOREACH my_relation GENERATE (my_field IS NOT NULL ? my_field
: 'other') as my_field, *;

Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

On Jun 8, 2012, at 3:01 PM, Dragan Nedeljkovic <dr...@yahoo.com> wrote:

> You can use an UDF like the one bellow to deal with  the NULLs.
>
> register 'mypiggybank.jar';
> define Nvl piggybank.Nvl();
>
> input_lines = LOAD 'test_Nvl.in' AS (line:chararray);
> describe input_lines;
> dump input_lines;
>
> new_list = FOREACH input_lines GENERATE Nvl(line, 'n/a');
> describe new_list;
> dump new_list;
>
> -- eof
>
> package piggybank;
>
> import java.io.IOException;
>
> import org.apache.pig.EvalFunc;
> import org.apache.pig.data.Tuple;
>
> public class Nvl
>    extends EvalFunc<String>
> {
>    public String exec(Tuple inputA)
>       throws IOException
>    {
>       try
>       {
>          String inputVal = (String)inputA.get(0);
>          String defaultVal = (String)inputA.get(1);
>
>          if(inputVal == null || inputVal.length() == 0)
>          {
>             return defaultVal;
>          }
>
>          return inputVal;
>       }
>       catch(Exception e)
>       {
>          // Throwing an exception will cause the task to fail.
>          throw new IOException("Something bad happened!", e);
>       }
>    }
> }
> // eof
>
>
>
>
>> ________________________________
>> From: Mario Lassnig <ma...@lassnig.net>
>> To: user@pig.apache.org
>> Sent: Thursday, June 7, 2012 7:37:36 AM
>> Subject: Replace null with string
>>
>> Hello,
>>
>> I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4)
>>
>> Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null.
>>
>> Example:
>>
>> tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name;
>>
>> Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead.
>>
>> Thanks a lot,
>> Mario
>>
>>

Re: Replace null with string

Posted by Dragan Nedeljkovic <dr...@yahoo.com>.
You can use an UDF like the one bellow to deal with  the NULLs.

register 'mypiggybank.jar';
define Nvl piggybank.Nvl();

input_lines = LOAD 'test_Nvl.in' AS (line:chararray);
describe input_lines;
dump input_lines;

new_list = FOREACH input_lines GENERATE Nvl(line, 'n/a');
describe new_list;
dump new_list;

-- eof

package piggybank;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Nvl
   extends EvalFunc<String>
{
   public String exec(Tuple inputA)
      throws IOException
   {
      try
      {
         String inputVal = (String)inputA.get(0);
         String defaultVal = (String)inputA.get(1);
         
         if(inputVal == null || inputVal.length() == 0)
         {
            return defaultVal;
         }

         return inputVal;
      }
      catch(Exception e)
      {
         // Throwing an exception will cause the task to fail.
         throw new IOException("Something bad happened!", e);
      }
   }
}
// eof




>________________________________
> From: Mario Lassnig <ma...@lassnig.net>
>To: user@pig.apache.org 
>Sent: Thursday, June 7, 2012 7:37:36 AM
>Subject: Replace null with string
> 
>Hello,
>
>I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4)
>
>Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null.
>
>Example:
>
>tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name;
>
>Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead.
>
>Thanks a lot,
>Mario
>
>
>

Re: Replace null with string

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Nonulls = foreach somenulls generate
  (field == null ? 'other' : field) as field;

On Jun 7, 2012, at 4:37 AM, Mario Lassnig <ma...@lassnig.net> wrote:

> Hello,
> 
> I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4)
> 
> Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null.
> 
> Example:
> 
> tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name;
> 
> Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead.
> 
> Thanks a lot,
> Mario