You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Ensor, Neal" <En...@osti.gov> on 2021/09/07 18:22:11 UTC

Tika 2.1 custom ContentHandler question

I have been using Tika for quite some time to handle document text extraction for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:

I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I have previously relied on to trigger page-based calls around <div class="page"> events on startElement.  This do not appear to be generated for me now?  Same code with 1.27 had no issues here.  I am in fact also no longer "seeing" any "<a>" tag events in my handler either.  Is there some alternative way to access the content handler I am not employing?

When using Tika 2.1.0 "app" via the command-line with --xml, the resulting document shows all the tags I expect, but my ContentHandler is not.   Any advice would be appreciated, and if more info or code snippets from me might help, I'd be happy to provide.

Thanks in advance!

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov

Re: Tika 2.1 custom ContentHandler question

Posted by Tim Allison <ta...@apache.org>.

No need to apologize at all.  There are some major differences in 2.x.
Many thanks for migrating and sharing your pain!

Please let us know what else you find.

Best,

     Tim

On Tue, Sep 7, 2021 at 4:01 PM Ensor, Neal <En...@osti.gov> wrote:

> Well, I must apologize for taking this to the list:  it seems this is
> entirely my fault.
>
> I had not properly followed the
> https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0
> Migrating to Tika 2.0.0 - TIKA - Apache Software Foundation
> <https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
> Note! In 2.x, Tika will not warn you if a PDF page that you're trying to
> render has a JPEG2000 in it. PDFBox will log a warning. tika-app tika-server
> cwiki.apache.org
>
> Specifically, it seems I should be importing
> "tika-parser-standard-package" rather than simply "tika-parsers".  After
> fixing that dependency my tests ran fine.  So, if anyone can learn from my
> mistake, hopefully that will have been at least worth it!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov
> ------------------------------
> *From:* Ensor, Neal <En...@osti.gov>
> *Sent:* Tuesday, September 7, 2021 3:18 PM
> *To:* user@tika.apache.org <us...@tika.apache.org>; tallison@apache.org <
> tallison@apache.org>
> *Subject:* Re: Tika 2.1 custom ContentHandler question
>
> Also, and probably more importantly, NONE of the text is being captured by
> my handler.  The "characters" method of SAX isn't being called at all,
> everything is coming back blank, so something is clearly not hooking up at
> all.
>
> Using the downloaded "app":
>
> java -jar tika-app-2.1.0.jar --xml myfile.pdf
>
> results in what I would expect, XML tags and all content intact.
>
> The maven dependencies of my project:
>
> <dependency>
>   <groupId>org.apache.tika</groupId>
>   <artifactId>tika-core</artifactId>
>   <version>2.1.0</version>
> </dependency>
>
> <dependency>
>   <groupId>org.apache.tika</groupId>
>   <artifactId>tika-parsers</artifactId>
>   <version>2.1.0</version>
>   <type>pom</type>
> </dependency>
>
> If that makes any difference..
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov
> ------------------------------
> *From:* Ensor, Neal <En...@osti.gov>
> *Sent:* Tuesday, September 7, 2021 3:03 PM
> *To:* user@tika.apache.org <us...@tika.apache.org>; tallison@apache.org <
> tallison@apache.org>
> *Subject:* Re: Tika 2.1 custom ContentHandler question
>
> Presently, I'm trying a mix of generally PDF and OpenOffice documents that
> have had previous success.  The relevant code is mainly something like this:
>
> AutoDetectParser parser = new AutoDetectParser();
> Metadata md = new Metadata();
> CustomContentHandler handler = new CustomContentHandler();
>
> parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);
>
> In the CustomContentHandler (extending DefaultHandler from SAX) I
> implement the startElement and endElement events, and log each such item.
> Previously, I would get a number of "a", "div", and other HTML tag events.
>  Since 2.1.0, I only get "html", "head", a few "meta" tags, and a "title",
> and "body".
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov
> ------------------------------
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Tuesday, September 7, 2021 2:57 PM
> *To:* user@tika.apache.org <us...@tika.apache.org>
> *Subject:* Re: Tika 2.1 custom ContentHandler question
>
> Can you share the code you're using?  How are you injecting your
> custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
> or something else?
>
> On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <En...@osti.gov> wrote:
> >
> > I have been using Tika for quite some time to handle document text
> extraction for SOLR indexing, but attempting to update to 2.1.0 I am
> encountering an issue:
> >
> > I have a custom SAX ContentHandler wired in to AutoDetectParser calls
> that I have previously relied on to trigger page-based calls around <div
> class="page"> events on startElement.  This do not appear to be generated
> for me now?  Same code with 1.27 had no issues here.  I am in fact also no
> longer "seeing" any "<a>" tag events in my handler either.  Is there some
> alternative way to access the content handler I am not employing?
> >
> > When using Tika 2.1.0 "app" via the command-line with --xml, the
> resulting document shows all the tags I expect, but my ContentHandler is
> not.   Any advice would be appreciated, and if more info or code snippets
> from me might help, I'd be happy to provide.
> >
> > Thanks in advance!
> >
> > Neal Ensor
> > U.S. Department of Energy
> > Office of Scientific and Technical Information
> > Oak Ridge, TN
> > (865) 576-1295
> > ensorn@osti.gov
>

Re: Tika 2.1 custom ContentHandler question

Posted by "Ensor, Neal" <En...@osti.gov>.

Well, I must apologize for taking this to the list:  it seems this is entirely my fault.

I had not properly followed the https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0
Migrating to Tika 2.0.0 - TIKA - Apache Software Foundation<https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
Note! In 2.x, Tika will not warn you if a PDF page that you're trying to render has a JPEG2000 in it. PDFBox will log a warning. tika-app tika-server
cwiki.apache.org

Specifically, it seems I should be importing "tika-parser-standard-package" rather than simply "tika-parsers".  After fixing that dependency my tests ran fine.  So, if anyone can learn from my mistake, hopefully that will have been at least worth it!

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov
________________________________
From: Ensor, Neal <En...@osti.gov>
Sent: Tuesday, September 7, 2021 3:18 PM
To: user@tika.apache.org <us...@tika.apache.org>; tallison@apache.org <ta...@apache.org>
Subject: Re: Tika 2.1 custom ContentHandler question

Also, and probably more importantly, NONE of the text is being captured by my handler.  The "characters" method of SAX isn't being called at all, everything is coming back blank, so something is clearly not hooking up at all.

Using the downloaded "app":

java -jar tika-app-2.1.0.jar --xml myfile.pdf

results in what I would expect, XML tags and all content intact.

The maven dependencies of my project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
  <version>2.1.0</version>
</dependency>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>2.1.0</version>
  <type>pom</type>
</dependency>

If that makes any difference..

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov
________________________________
From: Ensor, Neal <En...@osti.gov>
Sent: Tuesday, September 7, 2021 3:03 PM
To: user@tika.apache.org <us...@tika.apache.org>; tallison@apache.org <ta...@apache.org>
Subject: Re: Tika 2.1 custom ContentHandler question

Presently, I'm trying a mix of generally PDF and OpenOffice documents that have had previous success.  The relevant code is mainly something like this:

AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();

parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);

In the CustomContentHandler (extending DefaultHandler from SAX) I implement the startElement and endElement events, and log each such item.  Previously, I would get a number of "a", "div", and other HTML tag events.   Since 2.1.0, I only get "html", "head", a few "meta" tags, and a "title", and "body".

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov
________________________________
From: Tim Allison <ta...@apache.org>
Sent: Tuesday, September 7, 2021 2:57 PM
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Tika 2.1 custom ContentHandler question

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <En...@osti.gov> wrote:
>
> I have been using Tika for quite some time to handle document text extraction for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I have previously relied on to trigger page-based calls around <div class="page"> events on startElement.  This do not appear to be generated for me now?  Same code with 1.27 had no issues here.  I am in fact also no longer "seeing" any "<a>" tag events in my handler either.  Is there some alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting document shows all the tags I expect, but my ContentHandler is not.   Any advice would be appreciated, and if more info or code snippets from me might help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov

Re: Tika 2.1 custom ContentHandler question

Posted by "Ensor, Neal" <En...@osti.gov>.

Also, and probably more importantly, NONE of the text is being captured by my handler.  The "characters" method of SAX isn't being called at all, everything is coming back blank, so something is clearly not hooking up at all.

Using the downloaded "app":

java -jar tika-app-2.1.0.jar --xml myfile.pdf

results in what I would expect, XML tags and all content intact.

The maven dependencies of my project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
  <version>2.1.0</version>
</dependency>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>2.1.0</version>
  <type>pom</type>
</dependency>

If that makes any difference..

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov
________________________________
From: Ensor, Neal <En...@osti.gov>
Sent: Tuesday, September 7, 2021 3:03 PM
To: user@tika.apache.org <us...@tika.apache.org>; tallison@apache.org <ta...@apache.org>
Subject: Re: Tika 2.1 custom ContentHandler question

Presently, I'm trying a mix of generally PDF and OpenOffice documents that have had previous success.  The relevant code is mainly something like this:

AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();

parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);

In the CustomContentHandler (extending DefaultHandler from SAX) I implement the startElement and endElement events, and log each such item.  Previously, I would get a number of "a", "div", and other HTML tag events.   Since 2.1.0, I only get "html", "head", a few "meta" tags, and a "title", and "body".

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov
________________________________
From: Tim Allison <ta...@apache.org>
Sent: Tuesday, September 7, 2021 2:57 PM
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Tika 2.1 custom ContentHandler question

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <En...@osti.gov> wrote:
>
> I have been using Tika for quite some time to handle document text extraction for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I have previously relied on to trigger page-based calls around <div class="page"> events on startElement.  This do not appear to be generated for me now?  Same code with 1.27 had no issues here.  I am in fact also no longer "seeing" any "<a>" tag events in my handler either.  Is there some alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting document shows all the tags I expect, but my ContentHandler is not.   Any advice would be appreciated, and if more info or code snippets from me might help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov

Re: Tika 2.1 custom ContentHandler question

Posted by "Ensor, Neal" <En...@osti.gov>.

Presently, I'm trying a mix of generally PDF and OpenOffice documents that have had previous success.  The relevant code is mainly something like this:

AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();

parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);

In the CustomContentHandler (extending DefaultHandler from SAX) I implement the startElement and endElement events, and log each such item.  Previously, I would get a number of "a", "div", and other HTML tag events.   Since 2.1.0, I only get "html", "head", a few "meta" tags, and a "title", and "body".

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
ensorn@osti.gov
________________________________
From: Tim Allison <ta...@apache.org>
Sent: Tuesday, September 7, 2021 2:57 PM
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Tika 2.1 custom ContentHandler question

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <En...@osti.gov> wrote:
>
> I have been using Tika for quite some time to handle document text extraction for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I have previously relied on to trigger page-based calls around <div class="page"> events on startElement.  This do not appear to be generated for me now?  Same code with 1.27 had no issues here.  I am in fact also no longer "seeing" any "<a>" tag events in my handler either.  Is there some alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting document shows all the tags I expect, but my ContentHandler is not.   Any advice would be appreciated, and if more info or code snippets from me might help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov

Re: Tika 2.1 custom ContentHandler question

Posted by Tim Allison <ta...@apache.org>.

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <En...@osti.gov> wrote:
>
> I have been using Tika for quite some time to handle document text extraction for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I have previously relied on to trigger page-based calls around <div class="page"> events on startElement.  This do not appear to be generated for me now?  Same code with 1.27 had no issues here.  I am in fact also no longer "seeing" any "<a>" tag events in my handler either.  Is there some alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting document shows all the tags I expect, but my ContentHandler is not.   Any advice would be appreciated, and if more info or code snippets from me might help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> ensorn@osti.gov