You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Parth Sawant <pa...@gmail.com> on 2016/02/18 01:26:45 UTC

Using NOT NULL in a Pig FILTER statement.

I'm trying to Filter some null fields in Pig using 'IS NOT NULL' . For some
reason the null data values persist.
For eg: the following filter on storing it's contents, contains null values
for ABC and PQR.

X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR IS NOT
NULL) ;


Can someone help with this?

Thanks

Parth S

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

Great :)

> On Feb 19, 2016, at 8:58 PM, Parth Sawant <pa...@gmail.com> wrote:
> 
> Hi Chandeep,
> Thanks for your help. I figured it out too.
> 
> On Fri, Feb 19, 2016 at 9:30 AM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>> wrote:
> 
>> Yes, I did filter using the same conditions you’ve mentioned. I tested it
>> earlier with comma as the delimiter (previous email has logs) and now with
>> ^A.
>> 
>> [csingh~]$ cat -v test.txt
>> 1^A2^A76
>> 1^A^A^A76
>> ^A2^A^A76
>> 1^A1^A2^A
>> 1^A1^A1^A76
>> 1^A2^A1^A76
>> 
>> grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT,
>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>> grunt> DUMP D;
>> (1,2,76,)
>> (1,,,76)
>> (,2,,76)
>> (1,1,2,)
>> (1,1,1,76)
>> (1,2,1,76)
>> 
>> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
>> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>> 
>> grunt> DUMP X;
>> (1,2,1,76)
>> 
>> 
>> So, the filter for NULL’s is working as you can see when I dump after
>> filtering.
>> 
>>> On Feb 19, 2016, at 12:13 AM, Parth Sawant <pa...@gmail.com>
>> wrote:
>>> 
>>> Did you put a Filter on the values to remove the null? I'm trying to
>> filter
>>> the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
>>> integration to store the data. I have '\\u001' <smb://u001'> <smb://u001' <smb://u001'>> as the
>> delimiter for
>>> multiple files. It is supported by Pig BulkLoader too.
>>> 
>>> Snippet:
>>> 
>>> D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'> <smb://u001' <smb://u001'>>) as AS
>> (IS_REPORTED:INT,
>>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>>> 
>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
>> not
>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>> 
>>> On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>
>> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
>>> 
>>>> So, I added one record to your sample to match all the conditions you
>> have
>>>> in your filter statement.
>>>> 
>>>> New input:
>>>> [csingh]$ hadoop fs -cat test.txt
>>>> 1,,2,76
>>>> 1,,,76
>>>> ,2,,76
>>>> 1,1,2,
>>>> 1,1,1,76
>>>> 1,2,1,76
>>>> 
>>>> I modified the load statement to use PigStorage delimited by comma.
>>>> 
>>>> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>>>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>>>> 
>>>> Output:
>>>> (1,2,1,76)
>>>> 
>>>> So, the NOT NULL's seem to be working.
>>>> 
>>>> Pig Log’s:
>>>> 
>>>> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>>>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>>>> grunt> X = FILTER D BY (IS_REPORTED is not null) AND
>> (PROCESSING_STATUS_ID
>>>> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>>> grunt> DUMP X;
>>>> 2016-02-18 23:01:06,336 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>>>> script: FILTER
>>>> 2016-02-18 23:01:06,366 [main] INFO
>>>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
>>>> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
>>>> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
>>>> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter,
>> MergeFilter,
>>>> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
>>>> PushUpFilter, SplitFilter, StreamTypeCastInserter],
>>>> RULES_DISABLED=[FilterLogicExpressionSimplifier,
>> PartitionFilterOptimizer]}
>>>> 2016-02-18 23:01:06,480 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>> - MR plan size before optimization: 1
>>>> 2016-02-18 23:01:10,798 [JobControl] INFO
>>>> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
>>>> deprecated. Instead, use fs.defaultFS
>>>> 2016-02-18 23:01:11,345 [JobControl] INFO
>>>> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
>>>> job_1454499131434_9884
>>>> 2016-02-18 23:01:11,542 [JobControl] INFO
>>>> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
>>>> application application_1454499131434_9884
>>>> 2016-02-18 23:01:11,597 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 0% complete
>>>> 2016-02-18 23:01:31,393 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 50% complete
>>>> 2016-02-18 23:01:36,818 [main] INFO
>>>> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
>> is
>>>> deprecated. Instead, use mapreduce.job.reduces
>>>> 2016-02-18 23:01:36,875 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 100% complete
>>>> 2016-02-18 23:01:36,878 [main] INFO
>>>> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>>>> 
>>>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>>> Features
>>>> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06
>> 2016-02-18
>>>> 23:01:36     FILTER
>>>> 
>>>> Success!
>>>> 
>>>> Job Stats (time in seconds):
>>>> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
>>>> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
>>>> MedianReducetime        Alias   Feature Outputs
>>>> job_1454499131434_9884  1       0       8       8       8       8
>>>> n/a     n/a     n/a     n/a     D,X     MAP_ONLY
>>>> 
>>>> Input(s):
>>>> Successfully read 6 records (418 bytes) from:
>>>> 
>>>> Output(s):
>>>> Successfully stored 1 records (10 bytes) in:
>>>> 
>>>> Counters:
>>>> Total records written : 1
>>>> Total bytes written : 10
>>>> Spillable Memory Manager spill count : 0
>>>> Total bags proactively spilled: 0
>>>> Total records proactively spilled: 0
>>>> 
>>>> Job DAG:
>>>> job_1454499131434_9884
>>>> 
>>>> 2016-02-18 23:01:36,976 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - Success!
>>>> 2016-02-18 23:01:36,992 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths
>>>> to process : 1
>>>> 2016-02-18 23:01:36,993 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>>>> paths to process : 1
>>>> (1,2,1,76)
>>>> 
>>>> 
>>>> 
>>>>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>> 
>>>>> Attaching a sample input. Basically 5 rows with only 4 Integer values
>> in
>>>> each. Some are NULL values.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>
>>>> <mailto:cs@chandeep.com <ma...@chandeep.com> <mailto:cs@chandeep.com <ma...@chandeep.com>>>> wrote:
>>>>> I’m just looking for one sample record (which has NULL's) and not the
>>>> entire input so that its easier for me to debug.
>>>>> 
>>>>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>
>> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>
>>>> <mailto:parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>>
>> wrote:
>>>>>> 
>>>>>> The input is simply too large to relay to others. A simplified schema
>>>> is
>>>>>> below. I only have INT columns with some null values in them. This is
>>>> my
>>>>>> Pig code snippet:
>>>>>> 
>>>>>> D= LOAD 'src_locatn' as
>>>>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
>>>>>> AFFINITY_GROUP_ID:INT;
>>>>>> 
>>>>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
>>>> not
>>>>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>>>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>
>> <mailto:cs@chandeep.com <ma...@chandeep.com>>
>>>> <mailto:cs@chandeep.com <ma...@chandeep.com> <mailto:cs@chandeep.com <ma...@chandeep.com>>>> wrote:
>>>>>> 
>>>>>>> Any chance you could share a sample record which has NULL’s in it? as
>>>> well
>>>>>>> as your pig script?
>>>>>>> 
>>>>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>
>> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>
>>>> <mailto:parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I had anticipated it would throw a similar error with this
>>>> suggestion as
>>>>>>>> the last one... and it did. My fields are declared as INT, just to
>>>>>>>> re-iterate. I don't think they can be compared to regexes. Here is
>>>> the
>>>>>>>> error:
>>>>>>>> 
>>>>>>>> ERROR 1037:
>>>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>>>> 
>>>>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
>>>> ERROR
>>>>>>> 1037:
>>>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>
>> <mailto:cs@chandeep.com <ma...@chandeep.com>>
>>>> <mailto:cs@chandeep.com <ma...@chandeep.com> <mailto:cs@chandeep.com <ma...@chandeep.com>>>> wrote:
>>>>>>>> 
>>>>>>>>> Since you integers in this field can you try matching to a regular
>>>>>>>>> expression?
>>>>>>>>> 
>>>>>>>>> Something like: X matches '\\d+' <smb://d+'> <smb://d+' <smb://d+'>>
>>>>>>>>> 
>>>>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
>>>> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>> <mailto:
>> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Chandeep. I tried that already but it gave me the following
>>>> error:
>>>>>>>>>> 
>>>>>>>>>> ERROR 1039:
>>>>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
>>>>>>>>>> types in NotEqual Operator left hand side:int right hand
>>>>>>>>>> side:chararray.
>>>>>>>>>> 
>>>>>>>>>> The error makes sense cause the fields I have are INT type and
>>>> hence
>>>>>>>>>> cannot be compared to a chararray.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks for the prompt response though.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs@chandeep.com <ma...@chandeep.com> <mailto:
>> cs@chandeep.com <ma...@chandeep.com>> <mailto:
>>>> cs@chandeep.com <ma...@chandeep.com> <mailto:cs@chandeep.com <ma...@chandeep.com>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Try adding != '' along with IS NOT NULL.
>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
>>>> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>> <mailto:
>> parth.sawant90@gmail.com <ma...@gmail.com>>
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
>>>> .
>>>>>>> For
>>>>>>>>>>> some
>>>>>>>>>>>> reason the null data values persist.
>>>>>>>>>>>> For eg: the following filter on storing it's contents, contains
>>>> null
>>>>>>>>>>> values
>>>>>>>>>>>> for ABC and PQR.
>>>>>>>>>>>> 
>>>>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
>>>> (PQR
>>>>>>> IS
>>>>>>>>>>> NOT
>>>>>>>>>>>> NULL) ;
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Can someone help with this?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> 
>>>>>>>>>>>> Parth S
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> <Sample_in.txt>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Parth Sawant <pa...@gmail.com>.

Hi Chandeep,
Thanks for your help. I figured it out too.

On Fri, Feb 19, 2016 at 9:30 AM, Chandeep Singh <cs...@chandeep.com> wrote:

> Yes, I did filter using the same conditions you’ve mentioned. I tested it
> earlier with comma as the delimiter (previous email has logs) and now with
> ^A.
>
> [csingh~]$ cat -v test.txt
> 1^A2^A76
> 1^A^A^A76
> ^A2^A^A76
> 1^A1^A2^A
> 1^A1^A1^A76
> 1^A2^A1^A76
>
> grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> grunt> DUMP D;
> (1,2,76,)
> (1,,,76)
> (,2,,76)
> (1,1,2,)
> (1,1,1,76)
> (1,2,1,76)
>
> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>
> grunt> DUMP X;
> (1,2,1,76)
>
>
> So, the filter for NULL’s is working as you can see when I dump after
> filtering.
>
> > On Feb 19, 2016, at 12:13 AM, Parth Sawant <pa...@gmail.com>
> wrote:
> >
> > Did you put a Filter on the values to remove the null? I'm trying to
> filter
> > the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
> > integration to store the data. I have '\\u001' <smb://u001'> as the
> delimiter for
> > multiple files. It is supported by Pig BulkLoader too.
> >
> > Snippet:
> >
> > D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'>) as AS
> (IS_REPORTED:INT,
> > PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> >
> > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> not
> > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >
> > On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <cs@chandeep.com
> <ma...@chandeep.com>> wrote:
> >
> >> So, I added one record to your sample to match all the conditions you
> have
> >> in your filter statement.
> >>
> >> New input:
> >> [csingh]$ hadoop fs -cat test.txt
> >> 1,,2,76
> >> 1,,,76
> >> ,2,,76
> >> 1,1,2,
> >> 1,1,1,76
> >> 1,2,1,76
> >>
> >> I modified the load statement to use PigStorage delimited by comma.
> >>
> >> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> >>
> >> Output:
> >> (1,2,1,76)
> >>
> >> So, the NOT NULL's seem to be working.
> >>
> >> Pig Log’s:
> >>
> >> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> >> grunt> X = FILTER D BY (IS_REPORTED is not null) AND
> (PROCESSING_STATUS_ID
> >> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> >> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >> grunt> DUMP X;
> >> 2016-02-18 23:01:06,336 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> script: FILTER
> >> 2016-02-18 23:01:06,366 [main] INFO
> >> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
> >> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
> >> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
> >> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter,
> MergeFilter,
> >> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
> >> PushUpFilter, SplitFilter, StreamTypeCastInserter],
> >> RULES_DISABLED=[FilterLogicExpressionSimplifier,
> PartitionFilterOptimizer]}
> >> 2016-02-18 23:01:06,480 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> - MR plan size before optimization: 1
> >> 2016-02-18 23:01:10,798 [JobControl] INFO
> >> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
> >> deprecated. Instead, use fs.defaultFS
> >> 2016-02-18 23:01:11,345 [JobControl] INFO
> >> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
> >> job_1454499131434_9884
> >> 2016-02-18 23:01:11,542 [JobControl] INFO
> >> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
> >> application application_1454499131434_9884
> >> 2016-02-18 23:01:11,597 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - 0% complete
> >> 2016-02-18 23:01:31,393 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - 50% complete
> >> 2016-02-18 23:01:36,818 [main] INFO
> >> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
> is
> >> deprecated. Instead, use mapreduce.job.reduces
> >> 2016-02-18 23:01:36,875 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - 100% complete
> >> 2016-02-18 23:01:36,878 [main] INFO
> >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
> >>
> >> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
> >> Features
> >> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06
>  2016-02-18
> >> 23:01:36     FILTER
> >>
> >> Success!
> >>
> >> Job Stats (time in seconds):
> >> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
> >> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
> >> MedianReducetime        Alias   Feature Outputs
> >> job_1454499131434_9884  1       0       8       8       8       8
> >> n/a     n/a     n/a     n/a     D,X     MAP_ONLY
> >>
> >> Input(s):
> >> Successfully read 6 records (418 bytes) from:
> >>
> >> Output(s):
> >> Successfully stored 1 records (10 bytes) in:
> >>
> >> Counters:
> >> Total records written : 1
> >> Total bytes written : 10
> >> Spillable Memory Manager spill count : 0
> >> Total bags proactively spilled: 0
> >> Total records proactively spilled: 0
> >>
> >> Job DAG:
> >> job_1454499131434_9884
> >>
> >> 2016-02-18 23:01:36,976 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - Success!
> >> 2016-02-18 23:01:36,992 [main] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths
> >> to process : 1
> >> 2016-02-18 23:01:36,993 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> >> paths to process : 1
> >> (1,2,1,76)
> >>
> >>
> >>
> >>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <pa...@gmail.com>
> >> wrote:
> >>>
> >>> Attaching a sample input. Basically 5 rows with only 4 Integer values
> in
> >> each. Some are NULL values.
> >>>
> >>> Thanks.
> >>>
> >>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs@chandeep.com
> >> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
> >>> I’m just looking for one sample record (which has NULL's) and not the
> >> entire input so that its easier for me to debug.
> >>>
> >>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawant90@gmail.com
> <ma...@gmail.com>
> >> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>
> wrote:
> >>>>
> >>>> The input is simply too large to relay to others. A simplified schema
> >> is
> >>>> below. I only have INT columns with some null values in them. This is
> >> my
> >>>> Pig code snippet:
> >>>>
> >>>> D= LOAD 'src_locatn' as
> >>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> >>>> AFFINITY_GROUP_ID:INT;
> >>>>
> >>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> >> not
> >>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> >>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs@chandeep.com
> <ma...@chandeep.com>
> >> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
> >>>>
> >>>>> Any chance you could share a sample record which has NULL’s in it? as
> >> well
> >>>>> as your pig script?
> >>>>>
> >>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawant90@gmail.com
> <ma...@gmail.com>
> >> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> I had anticipated it would throw a similar error with this
> >> suggestion as
> >>>>>> the last one... and it did. My fields are declared as INT, just to
> >>>>>> re-iterate. I don't think they can be compared to regexes. Here is
> >> the
> >>>>>> error:
> >>>>>>
> >>>>>> ERROR 1037:
> >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>>>>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>>>>
> >>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
> >> ERROR
> >>>>> 1037:
> >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>>>>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs@chandeep.com
> <ma...@chandeep.com>
> >> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
> >>>>>>
> >>>>>>> Since you integers in this field can you try matching to a regular
> >>>>>>> expression?
> >>>>>>>
> >>>>>>> Something like: X matches '\\d+' <smb://d+'>
> >>>>>>>
> >>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
> >> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:
> parth.sawant90@gmail.com <ma...@gmail.com>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Chandeep. I tried that already but it gave me the following
> >> error:
> >>>>>>>>
> >>>>>>>> ERROR 1039:
> >>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> >>>>>>>> types in NotEqual Operator left hand side:int right hand
> >>>>>>>> side:chararray.
> >>>>>>>>
> >>>>>>>> The error makes sense cause the fields I have are INT type and
> >> hence
> >>>>>>>> cannot be compared to a chararray.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks for the prompt response though.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs@chandeep.com <mailto:
> cs@chandeep.com> <mailto:
> >> cs@chandeep.com <ma...@chandeep.com>>> wrote:
> >>>>>>>>
> >>>>>>>> Try adding != '' along with IS NOT NULL.
> >>>>>>>>>
> >>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
> >> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:
> parth.sawant90@gmail.com <ma...@gmail.com>>
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
> >> .
> >>>>> For
> >>>>>>>>> some
> >>>>>>>>>> reason the null data values persist.
> >>>>>>>>>> For eg: the following filter on storing it's contents, contains
> >> null
> >>>>>>>>> values
> >>>>>>>>>> for ABC and PQR.
> >>>>>>>>>>
> >>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
> >> (PQR
> >>>>> IS
> >>>>>>>>> NOT
> >>>>>>>>>> NULL) ;
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Can someone help with this?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> Parth S
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> <Sample_in.txt>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

Yes, I did filter using the same conditions you’ve mentioned. I tested it earlier with comma as the delimiter (previous email has logs) and now with ^A.

[csingh~]$ cat -v test.txt
1^A2^A76
1^A^A^A76
^A2^A^A76
1^A1^A2^A
1^A1^A1^A76
1^A2^A1^A76

grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
grunt> DUMP D;
(1,2,76,)
(1,,,76)
(,2,,76)
(1,1,2,)
(1,1,1,76)
(1,2,1,76)

grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);

grunt> DUMP X;
(1,2,1,76)


So, the filter for NULL’s is working as you can see when I dump after filtering.

> On Feb 19, 2016, at 12:13 AM, Parth Sawant <pa...@gmail.com> wrote:
> 
> Did you put a Filter on the values to remove the null? I'm trying to filter
> the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
> integration to store the data. I have '\\u001' <smb://u001'> as the delimiter for
> multiple files. It is supported by Pig BulkLoader too.
> 
> Snippet:
> 
> D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'>) as AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> 
> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> 
> On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>> wrote:
> 
>> So, I added one record to your sample to match all the conditions you have
>> in your filter statement.
>> 
>> New input:
>> [csingh]$ hadoop fs -cat test.txt
>> 1,,2,76
>> 1,,,76
>> ,2,,76
>> 1,1,2,
>> 1,1,1,76
>> 1,2,1,76
>> 
>> I modified the load statement to use PigStorage delimited by comma.
>> 
>> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>> 
>> Output:
>> (1,2,1,76)
>> 
>> So, the NOT NULL's seem to be working.
>> 
>> Pig Log’s:
>> 
>> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
>> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>> grunt> DUMP X;
>> 2016-02-18 23:01:06,336 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script: FILTER
>> 2016-02-18 23:01:06,366 [main] INFO
>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
>> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
>> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
>> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter,
>> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
>> PushUpFilter, SplitFilter, StreamTypeCastInserter],
>> RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]}
>> 2016-02-18 23:01:06,480 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> - MR plan size before optimization: 1
>> 2016-02-18 23:01:10,798 [JobControl] INFO
>> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
>> deprecated. Instead, use fs.defaultFS
>> 2016-02-18 23:01:11,345 [JobControl] INFO
>> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
>> job_1454499131434_9884
>> 2016-02-18 23:01:11,542 [JobControl] INFO
>> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
>> application application_1454499131434_9884
>> 2016-02-18 23:01:11,597 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - 0% complete
>> 2016-02-18 23:01:31,393 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - 50% complete
>> 2016-02-18 23:01:36,818 [main] INFO
>> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is
>> deprecated. Instead, use mapreduce.job.reduces
>> 2016-02-18 23:01:36,875 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - 100% complete
>> 2016-02-18 23:01:36,878 [main] INFO
>> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>> 
>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>> Features
>> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06     2016-02-18
>> 23:01:36     FILTER
>> 
>> Success!
>> 
>> Job Stats (time in seconds):
>> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
>> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
>> MedianReducetime        Alias   Feature Outputs
>> job_1454499131434_9884  1       0       8       8       8       8
>> n/a     n/a     n/a     n/a     D,X     MAP_ONLY
>> 
>> Input(s):
>> Successfully read 6 records (418 bytes) from:
>> 
>> Output(s):
>> Successfully stored 1 records (10 bytes) in:
>> 
>> Counters:
>> Total records written : 1
>> Total bytes written : 10
>> Spillable Memory Manager spill count : 0
>> Total bags proactively spilled: 0
>> Total records proactively spilled: 0
>> 
>> Job DAG:
>> job_1454499131434_9884
>> 
>> 2016-02-18 23:01:36,976 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - Success!
>> 2016-02-18 23:01:36,992 [main] INFO
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>> to process : 1
>> 2016-02-18 23:01:36,993 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
>> paths to process : 1
>> (1,2,1,76)
>> 
>> 
>> 
>>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <pa...@gmail.com>
>> wrote:
>>> 
>>> Attaching a sample input. Basically 5 rows with only 4 Integer values in
>> each. Some are NULL values.
>>> 
>>> Thanks.
>>> 
>>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs@chandeep.com
>> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
>>> I’m just looking for one sample record (which has NULL's) and not the
>> entire input so that its easier for me to debug.
>>> 
>>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>
>> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>> wrote:
>>>> 
>>>> The input is simply too large to relay to others. A simplified schema
>> is
>>>> below. I only have INT columns with some null values in them. This is
>> my
>>>> Pig code snippet:
>>>> 
>>>> D= LOAD 'src_locatn' as
>>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
>>>> AFFINITY_GROUP_ID:INT;
>>>> 
>>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
>> not
>>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>>> 
>>>> Thanks
>>>> 
>>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>
>> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
>>>> 
>>>>> Any chance you could share a sample record which has NULL’s in it? as
>> well
>>>>> as your pig script?
>>>>> 
>>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>
>> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>
>>>>> wrote:
>>>>>> 
>>>>>> I had anticipated it would throw a similar error with this
>> suggestion as
>>>>>> the last one... and it did. My fields are declared as INT, just to
>>>>>> re-iterate. I don't think they can be compared to regexes. Here is
>> the
>>>>>> error:
>>>>>> 
>>>>>> ERROR 1037:
>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>> 
>>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
>> ERROR
>>>>> 1037:
>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>
>> <mailto:cs@chandeep.com <ma...@chandeep.com>>> wrote:
>>>>>> 
>>>>>>> Since you integers in this field can you try matching to a regular
>>>>>>> expression?
>>>>>>> 
>>>>>>> Something like: X matches '\\d+' <smb://d+'>
>>>>>>> 
>>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
>> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Chandeep. I tried that already but it gave me the following
>> error:
>>>>>>>> 
>>>>>>>> ERROR 1039:
>>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
>>>>>>>> types in NotEqual Operator left hand side:int right hand
>>>>>>>> side:chararray.
>>>>>>>> 
>>>>>>>> The error makes sense cause the fields I have are INT type and
>> hence
>>>>>>>> cannot be compared to a chararray.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks for the prompt response though.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs@chandeep.com <ma...@chandeep.com> <mailto:
>> cs@chandeep.com <ma...@chandeep.com>>> wrote:
>>>>>>>> 
>>>>>>>> Try adding != '' along with IS NOT NULL.
>>>>>>>>> 
>>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
>> parth.sawant90@gmail.com <ma...@gmail.com> <mailto:parth.sawant90@gmail.com <ma...@gmail.com>>
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
>> .
>>>>> For
>>>>>>>>> some
>>>>>>>>>> reason the null data values persist.
>>>>>>>>>> For eg: the following filter on storing it's contents, contains
>> null
>>>>>>>>> values
>>>>>>>>>> for ABC and PQR.
>>>>>>>>>> 
>>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
>> (PQR
>>>>> IS
>>>>>>>>> NOT
>>>>>>>>>> NULL) ;
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Can someone help with this?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> Parth S
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> <Sample_in.txt>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Parth Sawant <pa...@gmail.com>.

Did you put a Filter on the values to remove the null? I'm trying to filter
the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
integration to store the data. I have '\\u001' as the delimiter for
multiple files. It is supported by Pig BulkLoader too.

Snippet:

D = LOAD 'src_dest' using PigStorage('\\u001') as AS (IS_REPORTED:INT,
PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);

 X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
(PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);

On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <cs...@chandeep.com> wrote:

> So, I added one record to your sample to match all the conditions you have
> in your filter statement.
>
> New input:
> [csingh]$ hadoop fs -cat test.txt
> 1,,2,76
> 1,,,76
> ,2,,76
> 1,1,2,
> 1,1,1,76
> 1,2,1,76
>
> I modified the load statement to use PigStorage delimited by comma.
>
> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>
> Output:
> (1,2,1,76)
>
> So, the NOT NULL's seem to be working.
>
> Pig Log’s:
>
> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> grunt> DUMP X;
> 2016-02-18 23:01:06,336 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: FILTER
> 2016-02-18 23:01:06,366 [main] INFO
> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter,
> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
> PushUpFilter, SplitFilter, StreamTypeCastInserter],
> RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]}
> 2016-02-18 23:01:06,480 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2016-02-18 23:01:10,798 [JobControl] INFO
> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
> deprecated. Instead, use fs.defaultFS
> 2016-02-18 23:01:11,345 [JobControl] INFO
> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
> job_1454499131434_9884
> 2016-02-18 23:01:11,542 [JobControl] INFO
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
> application application_1454499131434_9884
> 2016-02-18 23:01:11,597 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2016-02-18 23:01:31,393 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 50% complete
> 2016-02-18 23:01:36,818 [main] INFO
> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is
> deprecated. Instead, use mapreduce.job.reduces
> 2016-02-18 23:01:36,875 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2016-02-18 23:01:36,878 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
> Features
> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06     2016-02-18
> 23:01:36     FILTER
>
> Success!
>
> Job Stats (time in seconds):
> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
>  MedianReducetime        Alias   Feature Outputs
> job_1454499131434_9884  1       0       8       8       8       8
>  n/a     n/a     n/a     n/a     D,X     MAP_ONLY
>
> Input(s):
> Successfully read 6 records (418 bytes) from:
>
> Output(s):
> Successfully stored 1 records (10 bytes) in:
>
> Counters:
> Total records written : 1
> Total bytes written : 10
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_1454499131434_9884
>
> 2016-02-18 23:01:36,976 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
> 2016-02-18 23:01:36,992 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
> 2016-02-18 23:01:36,993 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths to process : 1
> (1,2,1,76)
>
>
>
> > On Feb 18, 2016, at 10:13 PM, Parth Sawant <pa...@gmail.com>
> wrote:
> >
> > Attaching a sample input. Basically 5 rows with only 4 Integer values in
> each. Some are NULL values.
> >
> > Thanks.
> >
> > On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs@chandeep.com
> <ma...@chandeep.com>> wrote:
> > I’m just looking for one sample record (which has NULL's) and not the
> entire input so that its easier for me to debug.
> >
> > > On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawant90@gmail.com
> <ma...@gmail.com>> wrote:
> > >
> > > The input is simply too large to relay to others. A simplified schema
> is
> > > below. I only have INT columns with some null values in them. This is
> my
> > > Pig code snippet:
> > >
> > > D= LOAD 'src_locatn' as
> > > IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> > > AFFINITY_GROUP_ID:INT;
> > >
> > > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> not
> > > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> > >
> > > Thanks
> > >
> > > On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs@chandeep.com
> <ma...@chandeep.com>> wrote:
> > >
> > >> Any chance you could share a sample record which has NULL’s in it? as
> well
> > >> as your pig script?
> > >>
> > >>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawant90@gmail.com
> <ma...@gmail.com>>
> > >> wrote:
> > >>>
> > >>> I had anticipated it would throw a similar error with this
> suggestion as
> > >>> the last one... and it did. My fields are declared as INT, just to
> > >>> re-iterate. I don't think they can be compared to regexes. Here is
> the
> > >>> error:
> > >>>
> > >>> ERROR 1037:
> > >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> > >>> CharArray only :(Name: Regex Type: null Uid: null)
> > >>>
> > >>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
> ERROR
> > >> 1037:
> > >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> > >>> CharArray only :(Name: Regex Type: null Uid: null)
> > >>>
> > >>>
> > >>>
> > >>> Thanks.
> > >>>
> > >>>
> > >>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs@chandeep.com
> <ma...@chandeep.com>> wrote:
> > >>>
> > >>>> Since you integers in this field can you try matching to a regular
> > >>>> expression?
> > >>>>
> > >>>> Something like: X matches '\\d+'
> > >>>>
> > >>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
> parth.sawant90@gmail.com <ma...@gmail.com>>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi Chandeep. I tried that already but it gave me the following
> error:
> > >>>>>
> > >>>>> ERROR 1039:
> > >>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> > >>>>> types in NotEqual Operator left hand side:int right hand
> > >>>>> side:chararray.
> > >>>>>
> > >>>>> The error makes sense cause the fields I have are INT type and
> hence
> > >>>>> cannot be compared to a chararray.
> > >>>>>
> > >>>>>
> > >>>>> Thanks for the prompt response though.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs@chandeep.com <mailto:
> cs@chandeep.com>> wrote:
> > >>>>>
> > >>>>> Try adding != '' along with IS NOT NULL.
> > >>>>>>
> > >>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
> parth.sawant90@gmail.com <ma...@gmail.com>
> > >>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
> .
> > >> For
> > >>>>>> some
> > >>>>>>> reason the null data values persist.
> > >>>>>>> For eg: the following filter on storing it's contents, contains
> null
> > >>>>>> values
> > >>>>>>> for ABC and PQR.
> > >>>>>>>
> > >>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
> (PQR
> > >> IS
> > >>>>>> NOT
> > >>>>>>> NULL) ;
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Can someone help with this?
> > >>>>>>>
> > >>>>>>> Thanks
> > >>>>>>>
> > >>>>>>> Parth S
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
> > <Sample_in.txt>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

So, I added one record to your sample to match all the conditions you have in your filter statement.

New input: 
[csingh]$ hadoop fs -cat test.txt
1,,2,76
1,,,76
,2,,76
1,1,2,
1,1,1,76
1,2,1,76

I modified the load statement to use PigStorage delimited by comma.

D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);

Output:
(1,2,1,76)

So, the NOT NULL's seem to be working.

Pig Log’s:

grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
grunt> DUMP X;
2016-02-18 23:01:06,336 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2016-02-18 23:01:06,366 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]}
2016-02-18 23:01:06,480 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-02-18 23:01:10,798 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-18 23:01:11,345 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1454499131434_9884
2016-02-18 23:01:11,542 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1454499131434_9884
2016-02-18 23:01:11,597 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2016-02-18 23:01:31,393 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2016-02-18 23:01:36,818 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2016-02-18 23:01:36,875 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-02-18 23:01:36,878 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0-cdh5.4.8	0.12.0-cdh5.4.8	csingh	2016-02-18 23:01:06	2016-02-18 23:01:36	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1454499131434_9884	1	0	8	8	8	8	n/a	n/a	n/a	n/a	D,X	MAP_ONLY

Input(s):
Successfully read 6 records (418 bytes) from: 

Output(s):
Successfully stored 1 records (10 bytes) in: 

Counters:
Total records written : 1
Total bytes written : 10
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1454499131434_9884

2016-02-18 23:01:36,976 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2016-02-18 23:01:36,992 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-02-18 23:01:36,993 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,2,1,76)

> On Feb 18, 2016, at 10:13 PM, Parth Sawant <pa...@gmail.com> wrote:
> 
> Attaching a sample input. Basically 5 rows with only 4 Integer values in each. Some are NULL values.
> 
> Thanks.
> 
> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>> wrote:
> I’m just looking for one sample record (which has NULL's) and not the entire input so that its easier for me to debug.
> 
> > On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>> wrote:
> >
> > The input is simply too large to relay to others. A simplified schema is
> > below. I only have INT columns with some null values in them. This is my
> > Pig code snippet:
> >
> > D= LOAD 'src_locatn' as
> > IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> > AFFINITY_GROUP_ID:INT;
> >
> > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
> > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >
> > Thanks
> >
> > On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>> wrote:
> >
> >> Any chance you could share a sample record which has NULL’s in it? as well
> >> as your pig script?
> >>
> >>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>>
> >> wrote:
> >>>
> >>> I had anticipated it would throw a similar error with this suggestion as
> >>> the last one... and it did. My fields are declared as INT, just to
> >>> re-iterate. I don't think they can be compared to regexes. Here is the
> >>> error:
> >>>
> >>> ERROR 1037:
> >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>
> >>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR
> >> 1037:
> >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>> wrote:
> >>>
> >>>> Since you integers in this field can you try matching to a regular
> >>>> expression?
> >>>>
> >>>> Something like: X matches '\\d+'
> >>>>
> >>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>>
> >>>> wrote:
> >>>>>
> >>>>> Hi Chandeep. I tried that already but it gave me the following error:
> >>>>>
> >>>>> ERROR 1039:
> >>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> >>>>> types in NotEqual Operator left hand side:int right hand
> >>>>> side:chararray.
> >>>>>
> >>>>> The error makes sense cause the fields I have are INT type and hence
> >>>>> cannot be compared to a chararray.
> >>>>>
> >>>>>
> >>>>> Thanks for the prompt response though.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs@chandeep.com <ma...@chandeep.com>> wrote:
> >>>>>
> >>>>> Try adding != '' along with IS NOT NULL.
> >>>>>>
> >>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <parth.sawant90@gmail.com <ma...@gmail.com>
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' .
> >> For
> >>>>>> some
> >>>>>>> reason the null data values persist.
> >>>>>>> For eg: the following filter on storing it's contents, contains null
> >>>>>> values
> >>>>>>> for ABC and PQR.
> >>>>>>>
> >>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR
> >> IS
> >>>>>> NOT
> >>>>>>> NULL) ;
> >>>>>>>
> >>>>>>>
> >>>>>>> Can someone help with this?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> Parth S
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> 
> 
> <Sample_in.txt>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Parth Sawant <pa...@gmail.com>.

Attaching a sample input. Basically 5 rows with only 4 Integer values in
each. Some are NULL values.

Thanks.

On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs...@chandeep.com> wrote:

> I’m just looking for one sample record (which has NULL's) and not the
> entire input so that its easier for me to debug.
>
> > On Feb 18, 2016, at 9:40 PM, Parth Sawant <pa...@gmail.com>
> wrote:
> >
> > The input is simply too large to relay to others. A simplified schema is
> > below. I only have INT columns with some null values in them. This is my
> > Pig code snippet:
> >
> > D= LOAD 'src_locatn' as
> > IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> > AFFINITY_GROUP_ID:INT;
> >
> > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> not
> > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >
> > Thanks
> >
> > On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs...@chandeep.com>
> wrote:
> >
> >> Any chance you could share a sample record which has NULL’s in it? as
> well
> >> as your pig script?
> >>
> >>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <pa...@gmail.com>
> >> wrote:
> >>>
> >>> I had anticipated it would throw a similar error with this suggestion
> as
> >>> the last one... and it did. My fields are declared as INT, just to
> >>> re-iterate. I don't think they can be compared to regexes. Here is the
> >>> error:
> >>>
> >>> ERROR 1037:
> >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>
> >>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR
> >> 1037:
> >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs...@chandeep.com>
> wrote:
> >>>
> >>>> Since you integers in this field can you try matching to a regular
> >>>> expression?
> >>>>
> >>>> Something like: X matches '\\d+'
> >>>>
> >>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <parth.sawant90@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>> Hi Chandeep. I tried that already but it gave me the following error:
> >>>>>
> >>>>> ERROR 1039:
> >>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> >>>>> types in NotEqual Operator left hand side:int right hand
> >>>>> side:chararray.
> >>>>>
> >>>>> The error makes sense cause the fields I have are INT type and hence
> >>>>> cannot be compared to a chararray.
> >>>>>
> >>>>>
> >>>>> Thanks for the prompt response though.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:
> >>>>>
> >>>>> Try adding != '' along with IS NOT NULL.
> >>>>>>
> >>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
> parth.sawant90@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' .
> >> For
> >>>>>> some
> >>>>>>> reason the null data values persist.
> >>>>>>> For eg: the following filter on storing it's contents, contains
> null
> >>>>>> values
> >>>>>>> for ABC and PQR.
> >>>>>>>
> >>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
> (PQR
> >> IS
> >>>>>> NOT
> >>>>>>> NULL) ;
> >>>>>>>
> >>>>>>>
> >>>>>>> Can someone help with this?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> Parth S
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

I’m just looking for one sample record (which has NULL's) and not the entire input so that its easier for me to debug.

> On Feb 18, 2016, at 9:40 PM, Parth Sawant <pa...@gmail.com> wrote:
> 
> The input is simply too large to relay to others. A simplified schema is
> below. I only have INT columns with some null values in them. This is my
> Pig code snippet:
> 
> D= LOAD 'src_locatn' as
> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> AFFINITY_GROUP_ID:INT;
> 
> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> 
> Thanks
> 
> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs...@chandeep.com> wrote:
> 
>> Any chance you could share a sample record which has NULL’s in it? as well
>> as your pig script?
>> 
>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <pa...@gmail.com>
>> wrote:
>>> 
>>> I had anticipated it would throw a similar error with this suggestion as
>>> the last one... and it did. My fields are declared as INT, just to
>>> re-iterate. I don't think they can be compared to regexes. Here is the
>>> error:
>>> 
>>> ERROR 1037:
>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>> CharArray only :(Name: Regex Type: null Uid: null)
>>> 
>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR
>> 1037:
>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>> CharArray only :(Name: Regex Type: null Uid: null)
>>> 
>>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs...@chandeep.com> wrote:
>>> 
>>>> Since you integers in this field can you try matching to a regular
>>>> expression?
>>>> 
>>>> Something like: X matches '\\d+'
>>>> 
>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <pa...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi Chandeep. I tried that already but it gave me the following error:
>>>>> 
>>>>> ERROR 1039:
>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
>>>>> types in NotEqual Operator left hand side:int right hand
>>>>> side:chararray.
>>>>> 
>>>>> The error makes sense cause the fields I have are INT type and hence
>>>>> cannot be compared to a chararray.
>>>>> 
>>>>> 
>>>>> Thanks for the prompt response though.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:
>>>>> 
>>>>> Try adding != '' along with IS NOT NULL.
>>>>>> 
>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <parth.sawant90@gmail.com
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' .
>> For
>>>>>> some
>>>>>>> reason the null data values persist.
>>>>>>> For eg: the following filter on storing it's contents, contains null
>>>>>> values
>>>>>>> for ABC and PQR.
>>>>>>> 
>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR
>> IS
>>>>>> NOT
>>>>>>> NULL) ;
>>>>>>> 
>>>>>>> 
>>>>>>> Can someone help with this?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> Parth S
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Parth Sawant <pa...@gmail.com>.

The input is simply too large to relay to others. A simplified schema is
below. I only have INT columns with some null values in them. This is my
Pig code snippet:

D= LOAD 'src_locatn' as
IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
AFFINITY_GROUP_ID:INT;

X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
(PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);

Thanks

On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs...@chandeep.com> wrote:

> Any chance you could share a sample record which has NULL’s in it? as well
> as your pig script?
>
> > On Feb 18, 2016, at 8:36 PM, Parth Sawant <pa...@gmail.com>
> wrote:
> >
> > I had anticipated it would throw a similar error with this suggestion as
> > the last one... and it did. My fields are declared as INT, just to
> > re-iterate. I don't think they can be compared to regexes. Here is the
> > error:
> >
> > ERROR 1037:
> > <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> > CharArray only :(Name: Regex Type: null Uid: null)
> >
> > org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR
> 1037:
> > <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> > CharArray only :(Name: Regex Type: null Uid: null)
> >
> >
> >
> > Thanks.
> >
> >
> > On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs...@chandeep.com> wrote:
> >
> >> Since you integers in this field can you try matching to a regular
> >> expression?
> >>
> >> Something like: X matches '\\d+'
> >>
> >>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <pa...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Chandeep. I tried that already but it gave me the following error:
> >>>
> >>> ERROR 1039:
> >>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> >>> types in NotEqual Operator left hand side:int right hand
> >>> side:chararray.
> >>>
> >>> The error makes sense cause the fields I have are INT type and hence
> >>> cannot be compared to a chararray.
> >>>
> >>>
> >>> Thanks for the prompt response though.
> >>>
> >>>
> >>>
> >>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:
> >>>
> >>> Try adding != '' along with IS NOT NULL.
> >>>>
> >>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <parth.sawant90@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' .
> For
> >>>> some
> >>>>> reason the null data values persist.
> >>>>> For eg: the following filter on storing it's contents, contains null
> >>>> values
> >>>>> for ABC and PQR.
> >>>>>
> >>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR
> IS
> >>>> NOT
> >>>>> NULL) ;
> >>>>>
> >>>>>
> >>>>> Can someone help with this?
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> Parth S
> >>>>
> >>>>
> >>
> >>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

Any chance you could share a sample record which has NULL’s in it? as well as your pig script?

> On Feb 18, 2016, at 8:36 PM, Parth Sawant <pa...@gmail.com> wrote:
> 
> I had anticipated it would throw a similar error with this suggestion as
> the last one... and it did. My fields are declared as INT, just to
> re-iterate. I don't think they can be compared to regexes. Here is the
> error:
> 
> ERROR 1037:
> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> CharArray only :(Name: Regex Type: null Uid: null)
> 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1037:
> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> CharArray only :(Name: Regex Type: null Uid: null)
> 
> 
> 
> Thanks.
> 
> 
> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs...@chandeep.com> wrote:
> 
>> Since you integers in this field can you try matching to a regular
>> expression?
>> 
>> Something like: X matches '\\d+'
>> 
>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <pa...@gmail.com>
>> wrote:
>>> 
>>> Hi Chandeep. I tried that already but it gave me the following error:
>>> 
>>> ERROR 1039:
>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
>>> types in NotEqual Operator left hand side:int right hand
>>> side:chararray.
>>> 
>>> The error makes sense cause the fields I have are INT type and hence
>>> cannot be compared to a chararray.
>>> 
>>> 
>>> Thanks for the prompt response though.
>>> 
>>> 
>>> 
>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:
>>> 
>>> Try adding != '' along with IS NOT NULL.
>>>> 
>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <pa...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' . For
>>>> some
>>>>> reason the null data values persist.
>>>>> For eg: the following filter on storing it's contents, contains null
>>>> values
>>>>> for ABC and PQR.
>>>>> 
>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR IS
>>>> NOT
>>>>> NULL) ;
>>>>> 
>>>>> 
>>>>> Can someone help with this?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Parth S
>>>> 
>>>> 
>> 
>>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Parth Sawant <pa...@gmail.com>.

I had anticipated it would throw a similar error with this suggestion as
the last one... and it did. My fields are declared as INT, just to
re-iterate. I don't think they can be compared to regexes. Here is the
error:

ERROR 1037:
<file LeadSales.pig, line 19, column 29> Operands of Regex can be
CharArray only :(Name: Regex Type: null Uid: null)

org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1037:
<file LeadSales.pig, line 19, column 29> Operands of Regex can be
CharArray only :(Name: Regex Type: null Uid: null)



Thanks.


On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs...@chandeep.com> wrote:

> Since you integers in this field can you try matching to a regular
> expression?
>
> Something like: X matches '\\d+'
>
> > On Feb 18, 2016, at 12:55 AM, Parth Sawant <pa...@gmail.com>
> wrote:
> >
> > Hi Chandeep. I tried that already but it gave me the following error:
> >
> > ERROR 1039:
> > <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> > types in NotEqual Operator left hand side:int right hand
> > side:chararray.
> >
> > The error makes sense cause the fields I have are INT type and hence
> > cannot be compared to a chararray.
> >
> >
> > Thanks for the prompt response though.
> >
> >
> >
> > On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:
> >
> > Try adding != '' along with IS NOT NULL.
> >>
> >>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <pa...@gmail.com>
> >> wrote:
> >>>
> >>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' . For
> >> some
> >>> reason the null data values persist.
> >>> For eg: the following filter on storing it's contents, contains null
> >> values
> >>> for ABC and PQR.
> >>>
> >>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR IS
> >> NOT
> >>> NULL) ;
> >>>
> >>>
> >>> Can someone help with this?
> >>>
> >>> Thanks
> >>>
> >>> Parth S
> >>
> >>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

Since you integers in this field can you try matching to a regular expression? 

Something like: X matches '\\d+'

> On Feb 18, 2016, at 12:55 AM, Parth Sawant <pa...@gmail.com> wrote:
> 
> Hi Chandeep. I tried that already but it gave me the following error:
> 
> ERROR 1039:
> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> types in NotEqual Operator left hand side:int right hand
> side:chararray.
> 
> The error makes sense cause the fields I have are INT type and hence
> cannot be compared to a chararray.
> 
> 
> Thanks for the prompt response though.
> 
> 
> 
> On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:
> 
> Try adding != '' along with IS NOT NULL.
>> 
>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <pa...@gmail.com>
>> wrote:
>>> 
>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' . For
>> some
>>> reason the null data values persist.
>>> For eg: the following filter on storing it's contents, contains null
>> values
>>> for ABC and PQR.
>>> 
>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR IS
>> NOT
>>> NULL) ;
>>> 
>>> 
>>> Can someone help with this?
>>> 
>>> Thanks
>>> 
>>> Parth S
>> 
>>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Parth Sawant <pa...@gmail.com>.

Hi Chandeep. I tried that already but it gave me the following error:

ERROR 1039:
<file LeadSales.pig, line 19, column 27> In alias X, incompatible
types in NotEqual Operator left hand side:int right hand
side:chararray.

 The error makes sense cause the fields I have are INT type and hence
cannot be compared to a chararray.

Thanks for the prompt response though.

On Feb 17, 2016 16:32, "Chandeep Singh" <cs...@chandeep.com> wrote:

Try adding != '' along with IS NOT NULL.
>
> > On Feb 18, 2016, at 12:26 AM, Parth Sawant <pa...@gmail.com>
> wrote:
> >
> > I'm trying to Filter some null fields in Pig using 'IS NOT NULL' . For
> some
> > reason the null data values persist.
> > For eg: the following filter on storing it's contents, contains null
> values
> > for ABC and PQR.
> >
> > X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR IS
> NOT
> > NULL) ;
> >
> >
> > Can someone help with this?
> >
> > Thanks
> >
> > Parth S
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Posted by Chandeep Singh <cs...@chandeep.com>.

Try adding != '' along with IS NOT NULL. 

> On Feb 18, 2016, at 12:26 AM, Parth Sawant <pa...@gmail.com> wrote:
> 
> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' . For some
> reason the null data values persist.
> For eg: the following filter on storing it's contents, contains null values
> for ABC and PQR.
> 
> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND (PQR IS NOT
> NULL) ;
> 
> 
> Can someone help with this?
> 
> Thanks
> 
> Parth S