You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Ryan <fr...@gmail.com> on 2014/12/05 18:09:34 UTC

Help with Pig UDF?

Hi,

I'm working on an open source project attempting to convert raw content
from a pdf (stored as a databytearray) into plain text using a Pig UDF and
Apache Tika. I could use your help. For some reason, the UDF I'm using
isn't working. The script succeeds but no output is written. *This is the
Pig script I'm following:*

register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
DEFINE ExtractTextFromPDFs
 org.warcbase.pig.piggybank.ExtractTextFromPDFs();
DEFINE ArcLoader org.warcbase.pig.ArcLoader();

raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date:
chararray, mime: chararray, content: bytearray); --load the data

a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from the
arc file
b = LIMIT a 2; --limit to 2 pages to speed up testing time
c = foreach b generate url, ExtractTextFromPDFs(content);
store c into 'output/pdf_test';


*This is the UDF I wrote:*

public class ExtractTextFromPDFs extends EvalFunc<String> {

  @Override
  public String exec(Tuple input) throws IOException {
      String pdfText = "";

      if (input == null || input.size() == 0 || input.get(0) == null) {
          return "N/A";
      }

      DataByteArray dba = (DataByteArray)input.get(0);
      pdfText.concat(String.valueOf(dba.size())); //my attempt at
debugging. Nothing written

      InputStream is = new ByteArrayInputStream(dba.get());

      ContentHandler contenthandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      DefaultDetector detector = new DefaultDetector();
      AutoDetectParser pdfparser = new AutoDetectParser(detector);

      try {
        pdfparser.parse(is, contenthandler, metadata, new ParseContext());
      } catch (SAXException | TikaException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
      }
      pdfText.concat(" : "); //another attempt at debugging. Still nothing
written
      pdfText.concat(contenthandler.toString());

      //close the input stream
      if(is != null){
        is.close();
      }
      return pdfText;
  }

}

Thank you for your assistance,
Ryan

Re: Help with Pig UDF?

Posted by Ryan <fr...@gmail.com>.

Got it, thanks! Any idea why Tika might not be working? I've been testing
and while no exceptions are being thrown, neither is anything being
appended when I call pdfText.append(contenthandler.toString());

On Fri, Dec 5, 2014 at 6:21 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> A static variable is not necessary... a simple instance variable is just
> fine.
>
> On Fri Dec 05 2014 at 2:27:53 PM Ryan <fr...@gmail.com>
> wrote:
>
> > After running it with updated code, it seems like the problem has to do
> > with something related to Tika since my output says that my input is the
> > correct number of bytes (i.e. it's actually being sent in correctly).
> Going
> > to test further to narrow down the problem.
> >
> > Pradeep, would you recommend using a static variable inside the
> > ExtractTextFromPDFs function to store the PdfParser once it has been
> > initialized once? I'm still learning how to best do things within the
> > Pig/MapReduce/Hadoop framework
> >
> > Ryan
> >
> > On Fri, Dec 5, 2014 at 1:35 PM, Ryan <fr...@gmail.com>
> > wrote:
> >
> > > Thanks Pradeep! I'll give it a try and report back
> > >
> > > Ryan
> > >
> > > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <
> pradeepg26@gmail.com
> > >
> > > wrote:
> > >
> > >> I forgot to mention earlier that you should probably move the
> PdfParser
> > >> initialization code out of the evaluate method. This will probably
> > cause a
> > >> significant overhead both in terms of gc and runtime performance.
> You'll
> > >> want to initialize your parser once and evaluate all your docs against
> > it.
> > >>
> > >> - Pradeep
> > >>
> > >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <
> > pradeepg26@gmail.com>
> > >> wrote:
> > >>
> > >> > Java string's are immutable. So "pdfText.concat()" returns a new
> > string
> > >> > and the original string is left unmolested. So at the end, all
> you're
> > >> doing
> > >> > is returning an empty string. Instead, you can do "pdfText =
> > >> > pdfText.concat(...)". But the better way to write it is to use a
> > >> > StringBuilder.
> > >> >
> > >> > StringBuilder pdfText = ...;
> > >> > pdfText.append(...);
> > >> > pdfText.append(...);
> > >> > ...
> > >> > return pdfText.toString();
> > >> >
> > >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <
> freelanceflashgames@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Hi,
> > >> >>
> > >> >> I'm working on an open source project attempting to convert raw
> > content
> > >> >> from a pdf (stored as a databytearray) into plain text using a Pig
> > UDF
> > >> and
> > >> >> Apache Tika. I could use your help. For some reason, the UDF I'm
> > using
> > >> >> isn't working. The script succeeds but no output is written. *This
> is
> > >> the
> > >> >> Pig script I'm following:*
> > >> >>
> > >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> > >> >> DEFINE ExtractTextFromPDFs
> > >> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> > >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> > >> >>
> > >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url:
> chararray,
> > >> >> date:
> > >> >> chararray, mime: chararray, content: bytearray); --load the data
> > >> >>
> > >> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
> > from
> > >> >> the
> > >> >> arc file
> > >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> > >> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> > >> >> store c into 'output/pdf_test';
> > >> >>
> > >> >>
> > >> >> *This is the UDF I wrote:*
> > >> >>
> > >> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
> > >> >>
> > >> >>   @Override
> > >> >>   public String exec(Tuple input) throws IOException {
> > >> >>       String pdfText = "";
> > >> >>
> > >> >>       if (input == null || input.size() == 0 || input.get(0) ==
> > null) {
> > >> >>           return "N/A";
> > >> >>       }
> > >> >>
> > >> >>       DataByteArray dba = (DataByteArray)input.get(0);
> > >> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> > >> >> debugging. Nothing written
> > >> >>
> > >> >>       InputStream is = new ByteArrayInputStream(dba.get());
> > >> >>
> > >> >>       ContentHandler contenthandler = new BodyContentHandler();
> > >> >>       Metadata metadata = new Metadata();
> > >> >>       DefaultDetector detector = new DefaultDetector();
> > >> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
> > >> >>
> > >> >>       try {
> > >> >>         pdfparser.parse(is, contenthandler, metadata, new
> > >> ParseContext());
> > >> >>       } catch (SAXException | TikaException e) {
> > >> >>         // TODO Auto-generated catch block
> > >> >>         e.printStackTrace();
> > >> >>       }
> > >> >>       pdfText.concat(" : "); //another attempt at debugging. Still
> > >> nothing
> > >> >> written
> > >> >>       pdfText.concat(contenthandler.toString());
> > >> >>
> > >> >>       //close the input stream
> > >> >>       if(is != null){
> > >> >>         is.close();
> > >> >>       }
> > >> >>       return pdfText;
> > >> >>   }
> > >> >>
> > >> >> }
> > >> >>
> > >> >> Thank you for your assistance,
> > >> >> Ryan
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>

Re: Help with Pig UDF?

Posted by Pradeep Gollakota <pr...@gmail.com>.

A static variable is not necessary... a simple instance variable is just
fine.

On Fri Dec 05 2014 at 2:27:53 PM Ryan <fr...@gmail.com> wrote:

> After running it with updated code, it seems like the problem has to do
> with something related to Tika since my output says that my input is the
> correct number of bytes (i.e. it's actually being sent in correctly). Going
> to test further to narrow down the problem.
>
> Pradeep, would you recommend using a static variable inside the
> ExtractTextFromPDFs function to store the PdfParser once it has been
> initialized once? I'm still learning how to best do things within the
> Pig/MapReduce/Hadoop framework
>
> Ryan
>
> On Fri, Dec 5, 2014 at 1:35 PM, Ryan <fr...@gmail.com>
> wrote:
>
> > Thanks Pradeep! I'll give it a try and report back
> >
> > Ryan
> >
> > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <pradeepg26@gmail.com
> >
> > wrote:
> >
> >> I forgot to mention earlier that you should probably move the PdfParser
> >> initialization code out of the evaluate method. This will probably
> cause a
> >> significant overhead both in terms of gc and runtime performance. You'll
> >> want to initialize your parser once and evaluate all your docs against
> it.
> >>
> >> - Pradeep
> >>
> >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <
> pradeepg26@gmail.com>
> >> wrote:
> >>
> >> > Java string's are immutable. So "pdfText.concat()" returns a new
> string
> >> > and the original string is left unmolested. So at the end, all you're
> >> doing
> >> > is returning an empty string. Instead, you can do "pdfText =
> >> > pdfText.concat(...)". But the better way to write it is to use a
> >> > StringBuilder.
> >> >
> >> > StringBuilder pdfText = ...;
> >> > pdfText.append(...);
> >> > pdfText.append(...);
> >> > ...
> >> > return pdfText.toString();
> >> >
> >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <fr...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm working on an open source project attempting to convert raw
> content
> >> >> from a pdf (stored as a databytearray) into plain text using a Pig
> UDF
> >> and
> >> >> Apache Tika. I could use your help. For some reason, the UDF I'm
> using
> >> >> isn't working. The script succeeds but no output is written. *This is
> >> the
> >> >> Pig script I'm following:*
> >> >>
> >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> >> >> DEFINE ExtractTextFromPDFs
> >> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> >> >>
> >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
> >> >> date:
> >> >> chararray, mime: chararray, content: bytearray); --load the data
> >> >>
> >> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
> from
> >> >> the
> >> >> arc file
> >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> >> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> >> >> store c into 'output/pdf_test';
> >> >>
> >> >>
> >> >> *This is the UDF I wrote:*
> >> >>
> >> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
> >> >>
> >> >>   @Override
> >> >>   public String exec(Tuple input) throws IOException {
> >> >>       String pdfText = "";
> >> >>
> >> >>       if (input == null || input.size() == 0 || input.get(0) ==
> null) {
> >> >>           return "N/A";
> >> >>       }
> >> >>
> >> >>       DataByteArray dba = (DataByteArray)input.get(0);
> >> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> >> >> debugging. Nothing written
> >> >>
> >> >>       InputStream is = new ByteArrayInputStream(dba.get());
> >> >>
> >> >>       ContentHandler contenthandler = new BodyContentHandler();
> >> >>       Metadata metadata = new Metadata();
> >> >>       DefaultDetector detector = new DefaultDetector();
> >> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
> >> >>
> >> >>       try {
> >> >>         pdfparser.parse(is, contenthandler, metadata, new
> >> ParseContext());
> >> >>       } catch (SAXException | TikaException e) {
> >> >>         // TODO Auto-generated catch block
> >> >>         e.printStackTrace();
> >> >>       }
> >> >>       pdfText.concat(" : "); //another attempt at debugging. Still
> >> nothing
> >> >> written
> >> >>       pdfText.concat(contenthandler.toString());
> >> >>
> >> >>       //close the input stream
> >> >>       if(is != null){
> >> >>         is.close();
> >> >>       }
> >> >>       return pdfText;
> >> >>   }
> >> >>
> >> >> }
> >> >>
> >> >> Thank you for your assistance,
> >> >> Ryan
> >> >>
> >> >
> >>
> >
> >
>

Re: Help with Pig UDF?

Posted by Ryan <fr...@gmail.com>.

After running it with updated code, it seems like the problem has to do
with something related to Tika since my output says that my input is the
correct number of bytes (i.e. it's actually being sent in correctly). Going
to test further to narrow down the problem.

Pradeep, would you recommend using a static variable inside the
ExtractTextFromPDFs function to store the PdfParser once it has been
initialized once? I'm still learning how to best do things within the
Pig/MapReduce/Hadoop framework

Ryan

On Fri, Dec 5, 2014 at 1:35 PM, Ryan <fr...@gmail.com> wrote:

> Thanks Pradeep! I'll give it a try and report back
>
> Ryan
>
> On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:
>
>> I forgot to mention earlier that you should probably move the PdfParser
>> initialization code out of the evaluate method. This will probably cause a
>> significant overhead both in terms of gc and runtime performance. You'll
>> want to initialize your parser once and evaluate all your docs against it.
>>
>> - Pradeep
>>
>> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <pr...@gmail.com>
>> wrote:
>>
>> > Java string's are immutable. So "pdfText.concat()" returns a new string
>> > and the original string is left unmolested. So at the end, all you're
>> doing
>> > is returning an empty string. Instead, you can do "pdfText =
>> > pdfText.concat(...)". But the better way to write it is to use a
>> > StringBuilder.
>> >
>> > StringBuilder pdfText = ...;
>> > pdfText.append(...);
>> > pdfText.append(...);
>> > ...
>> > return pdfText.toString();
>> >
>> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <fr...@gmail.com>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm working on an open source project attempting to convert raw content
>> >> from a pdf (stored as a databytearray) into plain text using a Pig UDF
>> and
>> >> Apache Tika. I could use your help. For some reason, the UDF I'm using
>> >> isn't working. The script succeeds but no output is written. *This is
>> the
>> >> Pig script I'm following:*
>> >>
>> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
>> >> DEFINE ExtractTextFromPDFs
>> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
>> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>> >>
>> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
>> >> date:
>> >> chararray, mime: chararray, content: bytearray); --load the data
>> >>
>> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
>> >> the
>> >> arc file
>> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
>> >> c = foreach b generate url, ExtractTextFromPDFs(content);
>> >> store c into 'output/pdf_test';
>> >>
>> >>
>> >> *This is the UDF I wrote:*
>> >>
>> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
>> >>
>> >>   @Override
>> >>   public String exec(Tuple input) throws IOException {
>> >>       String pdfText = "";
>> >>
>> >>       if (input == null || input.size() == 0 || input.get(0) == null) {
>> >>           return "N/A";
>> >>       }
>> >>
>> >>       DataByteArray dba = (DataByteArray)input.get(0);
>> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
>> >> debugging. Nothing written
>> >>
>> >>       InputStream is = new ByteArrayInputStream(dba.get());
>> >>
>> >>       ContentHandler contenthandler = new BodyContentHandler();
>> >>       Metadata metadata = new Metadata();
>> >>       DefaultDetector detector = new DefaultDetector();
>> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
>> >>
>> >>       try {
>> >>         pdfparser.parse(is, contenthandler, metadata, new
>> ParseContext());
>> >>       } catch (SAXException | TikaException e) {
>> >>         // TODO Auto-generated catch block
>> >>         e.printStackTrace();
>> >>       }
>> >>       pdfText.concat(" : "); //another attempt at debugging. Still
>> nothing
>> >> written
>> >>       pdfText.concat(contenthandler.toString());
>> >>
>> >>       //close the input stream
>> >>       if(is != null){
>> >>         is.close();
>> >>       }
>> >>       return pdfText;
>> >>   }
>> >>
>> >> }
>> >>
>> >> Thank you for your assistance,
>> >> Ryan
>> >>
>> >
>>
>
>

Re: Help with Pig UDF?

Posted by Ryan <fr...@gmail.com>.

Thanks Pradeep! I'll give it a try and report back

Ryan

On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> I forgot to mention earlier that you should probably move the PdfParser
> initialization code out of the evaluate method. This will probably cause a
> significant overhead both in terms of gc and runtime performance. You'll
> want to initialize your parser once and evaluate all your docs against it.
>
> - Pradeep
>
> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <pr...@gmail.com>
> wrote:
>
> > Java string's are immutable. So "pdfText.concat()" returns a new string
> > and the original string is left unmolested. So at the end, all you're
> doing
> > is returning an empty string. Instead, you can do "pdfText =
> > pdfText.concat(...)". But the better way to write it is to use a
> > StringBuilder.
> >
> > StringBuilder pdfText = ...;
> > pdfText.append(...);
> > pdfText.append(...);
> > ...
> > return pdfText.toString();
> >
> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <fr...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I'm working on an open source project attempting to convert raw content
> >> from a pdf (stored as a databytearray) into plain text using a Pig UDF
> and
> >> Apache Tika. I could use your help. For some reason, the UDF I'm using
> >> isn't working. The script succeeds but no output is written. *This is
> the
> >> Pig script I'm following:*
> >>
> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> >> DEFINE ExtractTextFromPDFs
> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> >>
> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
> >> date:
> >> chararray, mime: chararray, content: bytearray); --load the data
> >>
> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
> >> the
> >> arc file
> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> >> store c into 'output/pdf_test';
> >>
> >>
> >> *This is the UDF I wrote:*
> >>
> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
> >>
> >>   @Override
> >>   public String exec(Tuple input) throws IOException {
> >>       String pdfText = "";
> >>
> >>       if (input == null || input.size() == 0 || input.get(0) == null) {
> >>           return "N/A";
> >>       }
> >>
> >>       DataByteArray dba = (DataByteArray)input.get(0);
> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> >> debugging. Nothing written
> >>
> >>       InputStream is = new ByteArrayInputStream(dba.get());
> >>
> >>       ContentHandler contenthandler = new BodyContentHandler();
> >>       Metadata metadata = new Metadata();
> >>       DefaultDetector detector = new DefaultDetector();
> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
> >>
> >>       try {
> >>         pdfparser.parse(is, contenthandler, metadata, new
> ParseContext());
> >>       } catch (SAXException | TikaException e) {
> >>         // TODO Auto-generated catch block
> >>         e.printStackTrace();
> >>       }
> >>       pdfText.concat(" : "); //another attempt at debugging. Still
> nothing
> >> written
> >>       pdfText.concat(contenthandler.toString());
> >>
> >>       //close the input stream
> >>       if(is != null){
> >>         is.close();
> >>       }
> >>       return pdfText;
> >>   }
> >>
> >> }
> >>
> >> Thank you for your assistance,
> >> Ryan
> >>
> >
>

Re: Help with Pig UDF?

Posted by Pradeep Gollakota <pr...@gmail.com>.

I forgot to mention earlier that you should probably move the PdfParser
initialization code out of the evaluate method. This will probably cause a
significant overhead both in terms of gc and runtime performance. You'll
want to initialize your parser once and evaluate all your docs against it.

- Pradeep

On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <pr...@gmail.com>
wrote:

> Java string's are immutable. So "pdfText.concat()" returns a new string
> and the original string is left unmolested. So at the end, all you're doing
> is returning an empty string. Instead, you can do "pdfText =
> pdfText.concat(...)". But the better way to write it is to use a
> StringBuilder.
>
> StringBuilder pdfText = ...;
> pdfText.append(...);
> pdfText.append(...);
> ...
> return pdfText.toString();
>
> On Fri Dec 05 2014 at 9:12:37 AM Ryan <fr...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm working on an open source project attempting to convert raw content
>> from a pdf (stored as a databytearray) into plain text using a Pig UDF and
>> Apache Tika. I could use your help. For some reason, the UDF I'm using
>> isn't working. The script succeeds but no output is written. *This is the
>> Pig script I'm following:*
>>
>> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
>> DEFINE ExtractTextFromPDFs
>>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
>> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>>
>> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
>> date:
>> chararray, mime: chararray, content: bytearray); --load the data
>>
>> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
>> the
>> arc file
>> b = LIMIT a 2; --limit to 2 pages to speed up testing time
>> c = foreach b generate url, ExtractTextFromPDFs(content);
>> store c into 'output/pdf_test';
>>
>>
>> *This is the UDF I wrote:*
>>
>> public class ExtractTextFromPDFs extends EvalFunc<String> {
>>
>>   @Override
>>   public String exec(Tuple input) throws IOException {
>>       String pdfText = "";
>>
>>       if (input == null || input.size() == 0 || input.get(0) == null) {
>>           return "N/A";
>>       }
>>
>>       DataByteArray dba = (DataByteArray)input.get(0);
>>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
>> debugging. Nothing written
>>
>>       InputStream is = new ByteArrayInputStream(dba.get());
>>
>>       ContentHandler contenthandler = new BodyContentHandler();
>>       Metadata metadata = new Metadata();
>>       DefaultDetector detector = new DefaultDetector();
>>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
>>
>>       try {
>>         pdfparser.parse(is, contenthandler, metadata, new ParseContext());
>>       } catch (SAXException | TikaException e) {
>>         // TODO Auto-generated catch block
>>         e.printStackTrace();
>>       }
>>       pdfText.concat(" : "); //another attempt at debugging. Still nothing
>> written
>>       pdfText.concat(contenthandler.toString());
>>
>>       //close the input stream
>>       if(is != null){
>>         is.close();
>>       }
>>       return pdfText;
>>   }
>>
>> }
>>
>> Thank you for your assistance,
>> Ryan
>>
>

Re: Help with Pig UDF?

Posted by Pradeep Gollakota <pr...@gmail.com>.

Java string's are immutable. So "pdfText.concat()" returns a new string and
the original string is left unmolested. So at the end, all you're doing is
returning an empty string. Instead, you can do "pdfText =
pdfText.concat(...)". But the better way to write it is to use a
StringBuilder.

StringBuilder pdfText = ...;
pdfText.append(...);
pdfText.append(...);
...
return pdfText.toString();

On Fri Dec 05 2014 at 9:12:37 AM Ryan <fr...@gmail.com> wrote:

> Hi,
>
> I'm working on an open source project attempting to convert raw content
> from a pdf (stored as a databytearray) into plain text using a Pig UDF and
> Apache Tika. I could use your help. For some reason, the UDF I'm using
> isn't working. The script succeeds but no output is written. *This is the
> Pig script I'm following:*
>
> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> DEFINE ExtractTextFromPDFs
>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>
> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date:
> chararray, mime: chararray, content: bytearray); --load the data
>
> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from the
> arc file
> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> c = foreach b generate url, ExtractTextFromPDFs(content);
> store c into 'output/pdf_test';
>
>
> *This is the UDF I wrote:*
>
> public class ExtractTextFromPDFs extends EvalFunc<String> {
>
>   @Override
>   public String exec(Tuple input) throws IOException {
>       String pdfText = "";
>
>       if (input == null || input.size() == 0 || input.get(0) == null) {
>           return "N/A";
>       }
>
>       DataByteArray dba = (DataByteArray)input.get(0);
>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> debugging. Nothing written
>
>       InputStream is = new ByteArrayInputStream(dba.get());
>
>       ContentHandler contenthandler = new BodyContentHandler();
>       Metadata metadata = new Metadata();
>       DefaultDetector detector = new DefaultDetector();
>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
>
>       try {
>         pdfparser.parse(is, contenthandler, metadata, new ParseContext());
>       } catch (SAXException | TikaException e) {
>         // TODO Auto-generated catch block
>         e.printStackTrace();
>       }
>       pdfText.concat(" : "); //another attempt at debugging. Still nothing
> written
>       pdfText.concat(contenthandler.toString());
>
>       //close the input stream
>       if(is != null){
>         is.close();
>       }
>       return pdfText;
>   }
>
> }
>
> Thank you for your assistance,
> Ryan
>