You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by nskarthik <ns...@gmail.com> on 2021/10/21 14:42:10 UTC

Tika 2.1.0 pdf parser

Hi

Spec : JDK15.0, Tika-core2.1.0.jar ,win10

Process :  Non authenticated or Simple non-password Pdf text extraction  at page level using java

Question :  Need to extract Text / images at page level  using java.  Did not find any example on www or Tika website.

Request : Please share Java snippet code example or URL containing the same.


with regards
Karthik

Re: Tika 2.1.0 pdf parser

Posted by nskarthik <ns...@gmail.com>.

Hi

Ok so u say  POI is currently only Text extractor  for doc/docx...

I will do some HMWRK..and get back on the same.

This thread can be closed.

Thx for help appriciated

On 2021/10/22 20:57:00, Tim Allison <ta...@apache.org> wrote: 
> The other complication is how to handle embedded files. Perhaps punt on
> them to start?
> 
> On Fri, Oct 22, 2021 at 4:43 PM Tim Allison <ta...@apache.org> wrote:
> 
> > Hi Karthik,
> >
> >   Tika hasn't been set up well to extract images and text per page.
> > As Nick pointed out, we do mark page breaks in the xhtml, and we do
> > put links for image locations within the text for file types that
> > support that.
> >
> >    Part of the challenge is that not all document types are paged
> > (doc/docx), but also images get tricky quickly
> > (https://issues.apache.org/jira/browse/TIKA-3416).
> >
> >   There was a request to do something like this here:
> > https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've
> > been getting more requests to do this.  We might want to improve our
> > /unpack endpoint or create a new one.  I don't think I'll be able to
> > work on this for a bit.  Let us know what you find and how you solve
> > this.
> >
> >
> >       Best,
> >
> >             Tim
> >
> > On Fri, Oct 22, 2021 at 11:15 AM nskarthik <ns...@gmail.com> wrote:
> > >
> > > Hi
> > >
> > > I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on
> > >
> > > Instead of using POI / PDFBox /... thought Tika would be single source
> > of Data extraction...
> > >
> > > Hence wanted to use the same.
> > >
> > >
> > > with regards
> > > Karthik
> > >
> > > On 2021/10/22 14:41:38, AJ Weber <aw...@comcast.net> wrote:
> > > >
> > > > >>> Question :  Need to extract Text / images at page level using java.
> > > > >>> Did not find any example on www or Tika website.
> > > >
> > > > Why not use a library specifically suited to the job like Apache
> > PDFBox (directly)?
> > > >
> > > >
> > > >
> >
>

Re: Tika 2.1.0 pdf parser

Posted by Tim Allison <ta...@apache.org>.

The other complication is how to handle embedded files. Perhaps punt on
them to start?

On Fri, Oct 22, 2021 at 4:43 PM Tim Allison <ta...@apache.org> wrote:

> Hi Karthik,
>
>   Tika hasn't been set up well to extract images and text per page.
> As Nick pointed out, we do mark page breaks in the xhtml, and we do
> put links for image locations within the text for file types that
> support that.
>
>    Part of the challenge is that not all document types are paged
> (doc/docx), but also images get tricky quickly
> (https://issues.apache.org/jira/browse/TIKA-3416).
>
>   There was a request to do something like this here:
> https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've
> been getting more requests to do this.  We might want to improve our
> /unpack endpoint or create a new one.  I don't think I'll be able to
> work on this for a bit.  Let us know what you find and how you solve
> this.
>
>
>       Best,
>
>             Tim
>
> On Fri, Oct 22, 2021 at 11:15 AM nskarthik <ns...@gmail.com> wrote:
> >
> > Hi
> >
> > I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on
> >
> > Instead of using POI / PDFBox /... thought Tika would be single source
> of Data extraction...
> >
> > Hence wanted to use the same.
> >
> >
> > with regards
> > Karthik
> >
> > On 2021/10/22 14:41:38, AJ Weber <aw...@comcast.net> wrote:
> > >
> > > >>> Question :  Need to extract Text / images at page level using java.
> > > >>> Did not find any example on www or Tika website.
> > >
> > > Why not use a library specifically suited to the job like Apache
> PDFBox (directly)?
> > >
> > >
> > >
>

Re: Tika 2.1.0 pdf parser

Posted by Tim Allison <ta...@apache.org>.

Hi Karthik,

  Tika hasn't been set up well to extract images and text per page.
As Nick pointed out, we do mark page breaks in the xhtml, and we do
put links for image locations within the text for file types that
support that.

   Part of the challenge is that not all document types are paged
(doc/docx), but also images get tricky quickly
(https://issues.apache.org/jira/browse/TIKA-3416).

  There was a request to do something like this here:
https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've
been getting more requests to do this.  We might want to improve our
/unpack endpoint or create a new one.  I don't think I'll be able to
work on this for a bit.  Let us know what you find and how you solve
this.

      Best,

            Tim

On Fri, Oct 22, 2021 at 11:15 AM nskarthik <ns...@gmail.com> wrote:
>
> Hi
>
> I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on
>
> Instead of using POI / PDFBox /... thought Tika would be single source of Data extraction...
>
> Hence wanted to use the same.
>
>
> with regards
> Karthik
>
> On 2021/10/22 14:41:38, AJ Weber <aw...@comcast.net> wrote:
> >
> > >>> Question :  Need to extract Text / images at page level using java.
> > >>> Did not find any example on www or Tika website.
> >
> > Why not use a library specifically suited to the job like Apache PDFBox (directly)?
> >
> >
> >

Re: Tika 2.1.0 pdf parser

Posted by nskarthik <ns...@gmail.com>.

Hi

I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on

Instead of using POI / PDFBox /... thought Tika would be single source of Data extraction...

Hence wanted to use the same.


with regards
Karthik

On 2021/10/22 14:41:38, AJ Weber <aw...@comcast.net> wrote: 
> 
> >>> Question :  Need to extract Text / images at page level using java.
> >>> Did not find any example on www or Tika website.
> 
> Why not use a library specifically suited to the job like Apache PDFBox (directly)?
>   
> 
>

Re: Tika 2.1.0 pdf parser

Posted by AJ Weber <aw...@comcast.net>.

>>> Question :  Need to extract Text / images at page level using java.
>>> Did not find any example on www or Tika website.

Why not use a library specifically suited to the job like Apache PDFBox (directly)?

Re: Tika 2.1.0 pdf parser

Posted by nskarthik <ns...@gmail.com>.

Hi

Thx for the Suggestion...

Do we have a simple example for the same.

please share


with regards
Karthik

On 2021/10/21 18:26:58, Nick Burch <ap...@gagravarr.org> wrote: 
> On Thu, 21 Oct 2021, nskarthik wrote:
> > Question :  Need to extract Text / images at page level using java. 
> > Did not find any example on www or Tika website.
> 
> For PDF, you should fetch the contents as XHTML rather than plain text. 
> You can then split on the page divs. This isn't available for formats 
> which aren't page-based, but luckily PDF is
> 
> Depending on what you want to do, it might make sense to write a custom 
> ContentHandler which works a lot like the ToTextContentHandler in Tika, 
> but which starts writing to a new text buffer each time it hits the event 
> for a new page
> 
> Nick
>

Re: Tika 2.1.0 pdf parser

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 21 Oct 2021, nskarthik wrote:
> Question :  Need to extract Text / images at page level using java. 
> Did not find any example on www or Tika website.

For PDF, you should fetch the contents as XHTML rather than plain text. 
You can then split on the page divs. This isn't available for formats 
which aren't page-based, but luckily PDF is

Depending on what you want to do, it might make sense to write a custom 
ContentHandler which works a lot like the ToTextContentHandler in Tika, 
but which starts writing to a new text buffer each time it hits the event 
for a new page

Nick