You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Milos Kovacevic <fo...@gmail.com> on 2008/11/13 21:04:33 UTC

Parsing incomplete PDF and Office files

Hello,

I would like to download just a few kilobytes of a PDF(doc) file and to
extract the text from it. I do not want to download the whole file and then
to parse it, just truncated first N Kbs. Is it possible with Tika or not? If
not how should I do that?

Regards, Milos

Re: Parsing incomplete PDF and Office files

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Nov 14, 2008 at 8:32 AM, Milos Kovacevic <fo...@gmail.com> wrote:
> could you please give an example how to parse PDF page-by-page?

You'll want to contact pdfbox-users@incubator.apache.org for that.

I know that PDFBox is able to parse linear PDF documents (i.e. ones
that are internally stored in a page-by-page order), but AFAIK that
streaming capability is currently not used in the higher level
features like the PDFTextStripper class (even though it already does
use an event model).

BR,

Jukka Zitting

Re: Parsing incomplete PDF and Office files

Posted by Milos Kovacevic <fo...@gmail.com>.
Hello,


> That's currently not possible, but AFAIK there is support for
> page-by-page streaming in PDFBox (for PDF documents that support that,
> not all of them do). It would be nice if Tika could leverage that
> functionality in PDFBox.
>

could you please give an example how to parse PDF page-by-page?
Thanks, Milos

Re: Parsing incomplete PDF and Office files

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Nov 14, 2008 at 1:22 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> On a related note, does Tika support full text extraction of PDFs?

Yes. See http://incubator.apache.org/tika/formats.html (to be moved to
lucene.apache.org) for all the supported formats.

BR,

Jukka Zitting

Re: Parsing incomplete PDF and Office files

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.
On a related note, does Tika support full text extraction of PDFs?

On Nov 13, 2008, at 1:52 PM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic  
> <fo...@gmail.com> wrote:
>> I would like to download just a few kilobytes of a PDF(doc) file  
>> and to
>> extract the text from it. I do not want to download the whole file  
>> and then
>> to parse it, just truncated first N Kbs. Is it possible with Tika  
>> or not? If
>> not how should I do that?
>
> That's currently not possible, but AFAIK there is support for
> page-by-page streaming in PDFBox (for PDF documents that support that,
> not all of them do). It would be nice if Tika could leverage that
> functionality in PDFBox.
>
> However, I'm not sure how well that would work with truncated streams.
> I guess the reasonable approach would be to stream as much text as can
> be parsed, and then fail with a TikaException if the input stream ends
> unexpectedly. Your application would then need to be aware of this
> error condition and handle it appropriately.
>
> BR,
>
> Jukka Zitting

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Re: Parsing incomplete PDF and Office files

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic <fo...@gmail.com> wrote:
> I would like to download just a few kilobytes of a PDF(doc) file and to
> extract the text from it. I do not want to download the whole file and then
> to parse it, just truncated first N Kbs. Is it possible with Tika or not? If
> not how should I do that?

That's currently not possible, but AFAIK there is support for
page-by-page streaming in PDFBox (for PDF documents that support that,
not all of them do). It would be nice if Tika could leverage that
functionality in PDFBox.

However, I'm not sure how well that would work with truncated streams.
I guess the reasonable approach would be to stream as much text as can
be parsed, and then fail with a TikaException if the input stream ends
unexpectedly. Your application would then need to be aware of this
error condition and handle it appropriately.

BR,

Jukka Zitting