You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Steve Watt <wa...@gmail.com> on 2011/04/22 23:25:31 UTC

Pig FILTER with INDEXOF not working

Hi Folks

I've done a load of a dataset and I am attempting to filter out unwanted
records by checking that one of my tuple fields contains a particular
string. I've distilled this issue down to the sample excite.log that ships
with Pig for easy recreation. I've read through the INDEXOF code and I think
this should work (lots of queries that contain the word yahoo) but my
queries dump always contains zero records. Can anyone tell me what I am
doing wrong?

raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
query);
queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
dump queries;

Regards
Steve Watt

Re: Pig FILTER with INDEXOF not working

Posted by Aniket Mokashi <am...@andrew.cmu.edu>.

I think the fix is-
tuple.set(0, new DataByteArray(url));
to
tuple.set(0, url);

Thanks,
Aniket

On Fri, April 22, 2011 8:30 pm, Steve Watt wrote:
> Richard, if you're coming to OSCON or Hadoop Summit, please let me know
> so I can buy you a beer. Thanks for the help. This now works for with the
> excite log using PigStorage();
>
> It is however still not working with my custom LoadFunc and data. For
> reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
>  Nutch Segments that reads in each page that is crawled and represents it
> as a Tuple of (Url, ContentType, PageContent) as shown in the script
> below:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
>  using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
> content:chararray);
> companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies;
>
> This keeps failing with ERROR 1071: Cannot convert a
> generic_writablecomparable to a String. However, If I change the script to
>  the following (remove schema type & straight dump after load), it works:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
>  using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl;
>
>
> Clearly, as soon as I inject types into the Load Schema it starts
> bombing. Can anyone tell me what I am doing wrong? I have attached my
> Nutch LoadFunc
> below for reference:
>
> public class SegmentLoader extends FileInputLoadFunc {
>
> private SequenceFileRecordReader<WritableComparable, Content> reader;
> protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
> @Override
> public void setLocation(String location, Job job) throws IOException {
> FileInputFormat.setInputPaths(job, location);
> }
> @SuppressWarnings("unchecked")
> @Override
> public InputFormat getInputFormat() throws IOException { return new
> SequenceFileInputFormat<WritableComparable, Content>();
> }
>
>
> @SuppressWarnings("unchecked")
> @Override
> public void prepareToRead(RecordReader reader, PigSplit split) throws
> IOException {
> this.reader = (SequenceFileRecordReader) reader; }
>
>
> @Override
> public Tuple getNext() throws IOException { try { if
> (!reader.nextKeyValue()){
> return null; }
> Content value = ((Content)reader.getCurrentValue());
> String url = value.getUrl();
> String type = value.getContentType();
> String content = value.getContent().toString();
> Tuple tuple  = TupleFactory.getInstance().newTuple(3);
> tuple.set(0, new DataByteArray(url)); tuple.set(1, new
> DataByteArray(type));
> tuple.set(2, new DataByteArray(content)); return tuple; } catch
> (InterruptedException e){
> throw new ExecException(e); }
> }
>
>
> }
>
>
> On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <rd...@yahoo-inc.com>
> wrote:
>
>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query:chararray);
>>
>>
>> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries;
>>
>>
>> On 4/22/11 2:25 PM, "Steve Watt" <wa...@gmail.com> wrote:
>>
>>
>> Hi Folks
>>
>>
>> I've done a load of a dataset and I am attempting to filter out
>> unwanted records by checking that one of my tuple fields contains a
>> particular string. I've distilled this issue down to the sample
>> excite.log that ships with Pig for easy recreation. I've read through
>> the INDEXOF code and I think this should work (lots of queries that
>> contain the word yahoo) but my queries dump always contains zero
>> records. Can anyone tell me what I am doing wrong?
>>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump
>> queries;
>>
>> Regards
>> Steve Watt
>>
>>
>>
>

Re: Pig FILTER with INDEXOF not working

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

If the expected return type of your loader is (String, String, String) you
should just put Strings into the tuple (no conversion to DataByteArrays) and
report your schema to Pig via
an implementation of LoadMetadata.getSchema()

D

On Fri, Apr 22, 2011 at 5:30 PM, Steve Watt <wa...@gmail.com> wrote:

> Richard, if you're coming to OSCON or Hadoop Summit, please let me know so
> I
> can buy you a beer. Thanks for the help. This now works for with the excite
> log using PigStorage();
>
> It is however still not working with my custom LoadFunc and data. For
> reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
> Nutch Segments that reads in each page that is crawled and represents it as
> a Tuple of (Url, ContentType, PageContent) as shown in the script below:
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
> using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
> content:chararray);
> companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0);
> dump companies;
>
> This keeps failing with ERROR 1071: Cannot convert a
> generic_writablecomparable to a String. However, If I change the script to
> the following (remove schema type & straight dump after load), it works:
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
> using com.hp.demo.SegmentLoader() AS (url, type, content);
> dump webcrawl;
>
> Clearly, as soon as I inject types into the Load Schema it starts bombing.
> Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc
> below for reference:
>
> public class SegmentLoader extends FileInputLoadFunc {
>
> private SequenceFileRecordReader<WritableComparable, Content> reader;
> protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
>  @Override
> public void setLocation(String location, Job job) throws IOException {
> FileInputFormat.setInputPaths(job, location);
> }
>  @SuppressWarnings("unchecked")
> @Override
> public InputFormat getInputFormat() throws IOException {
> return new SequenceFileInputFormat<WritableComparable, Content>();
> }
>
> @SuppressWarnings("unchecked")
> @Override
> public void prepareToRead(RecordReader reader, PigSplit split) throws
> IOException {
> this.reader = (SequenceFileRecordReader) reader;
> }
>
> @Override
> public Tuple getNext() throws IOException {
> try {
> if (!reader.nextKeyValue()){
> return null;
> }
>  Content value = ((Content)reader.getCurrentValue());
> String url = value.getUrl();
> String type = value.getContentType();
> String content = value.getContent().toString();
>  Tuple tuple  = TupleFactory.getInstance().newTuple(3);
> tuple.set(0, new DataByteArray(url));
> tuple.set(1, new DataByteArray(type));
> tuple.set(2, new DataByteArray(content));
> return tuple;
> } catch (InterruptedException e){
> throw new ExecException(e);
> }
> }
>
> }
>
> On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <rd...@yahoo-inc.com> wrote:
>
> >  raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> > query:chararray);
> >
> > queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
> > dump queries;
> >
> >
> > On 4/22/11 2:25 PM, "Steve Watt" <wa...@gmail.com> wrote:
> >
> > Hi Folks
> >
> > I've done a load of a dataset and I am attempting to filter out unwanted
> > records by checking that one of my tuple fields contains a particular
> > string. I've distilled this issue down to the sample excite.log that
> ships
> > with Pig for easy recreation. I've read through the INDEXOF code and I
> > think
> > this should work (lots of queries that contain the word yahoo) but my
> > queries dump always contains zero records. Can anyone tell me what I am
> > doing wrong?
> >
> > raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> > query);
> > queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
> > dump queries;
> >
> > Regards
> > Steve Watt
> >
> >
>

Re: Pig FILTER with INDEXOF not working

Posted by Steve Watt <wa...@gmail.com>.

Richard, if you're coming to OSCON or Hadoop Summit, please let me know so I
can buy you a beer. Thanks for the help. This now works for with the excite
log using PigStorage();

It is however still not working with my custom LoadFunc and data. For
reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
Nutch Segments that reads in each page that is crawled and represents it as
a Tuple of (Url, ContentType, PageContent) as shown in the script below:

webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
content:chararray);
companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0);
dump companies;

This keeps failing with ERROR 1071: Cannot convert a
generic_writablecomparable to a String. However, If I change the script to
the following (remove schema type & straight dump after load), it works:

webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
using com.hp.demo.SegmentLoader() AS (url, type, content);
dump webcrawl;

Clearly, as soon as I inject types into the Load Schema it starts bombing.
Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc
below for reference:

public class SegmentLoader extends FileInputLoadFunc {

private SequenceFileRecordReader<WritableComparable, Content> reader;
protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
 @Override
public void setLocation(String location, Job job) throws IOException {
FileInputFormat.setInputPaths(job, location);
}
 @SuppressWarnings("unchecked")
@Override
public InputFormat getInputFormat() throws IOException {
return new SequenceFileInputFormat<WritableComparable, Content>();
}

@SuppressWarnings("unchecked")
@Override
public void prepareToRead(RecordReader reader, PigSplit split) throws
IOException {
this.reader = (SequenceFileRecordReader) reader;
}

@Override
public Tuple getNext() throws IOException {
try {
if (!reader.nextKeyValue()){
return null;
}
 Content value = ((Content)reader.getCurrentValue());
String url = value.getUrl();
String type = value.getContentType();
String content = value.getContent().toString();
 Tuple tuple  = TupleFactory.getInstance().newTuple(3);
tuple.set(0, new DataByteArray(url));
tuple.set(1, new DataByteArray(type));
tuple.set(2, new DataByteArray(content));
return tuple;
} catch (InterruptedException e){
throw new ExecException(e);
}
}

}

On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <rd...@yahoo-inc.com> wrote:

>  raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> query:chararray);
>
> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
> dump queries;
>
>
> On 4/22/11 2:25 PM, "Steve Watt" <wa...@gmail.com> wrote:
>
> Hi Folks
>
> I've done a load of a dataset and I am attempting to filter out unwanted
> records by checking that one of my tuple fields contains a particular
> string. I've distilled this issue down to the sample excite.log that ships
> with Pig for easy recreation. I've read through the INDEXOF code and I
> think
> this should work (lots of queries that contain the word yahoo) but my
> queries dump always contains zero records. Can anyone tell me what I am
> doing wrong?
>
> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> query);
> queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
> dump queries;
>
> Regards
> Steve Watt
>
>

Re: Pig FILTER with INDEXOF not working

Posted by Richard Ding <rd...@yahoo-inc.com>.

raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, query:chararray);
queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
dump queries;


On 4/22/11 2:25 PM, "Steve Watt" <wa...@gmail.com> wrote:

Hi Folks

I've done a load of a dataset and I am attempting to filter out unwanted
records by checking that one of my tuple fields contains a particular
string. I've distilled this issue down to the sample excite.log that ships
with Pig for easy recreation. I've read through the INDEXOF code and I think
this should work (lots of queries that contain the word yahoo) but my
queries dump always contains zero records. Can anyone tell me what I am
doing wrong?

raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
query);
queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
dump queries;

Regards
Steve Watt