You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ben Litchfield <be...@csh.rit.edu> on 2002/07/09 16:47:34 UTC

PDF Text Stripper

Hi,

I have written a PDF library that can be used to strip text from PDF
documents.  It is released under LGPL so have fun.

There is one class which can be used to easily index PDF documents.
pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
method which will take a PDF file and return a Lucene Document which you
can add to an index.

If you would like to see the quality of the text extraction you can run
pdfparser.Main from the command line which will take a PDF document and
write a txt file.

I am looking for any input that you might have.  Please mail me if you
have any bugs or feature requests.

The library can be retrieved from
http://www.csh.rit.edu/~ben/projects/pdfparser/

-Ben Litchfield


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: PDF Text Stripper

Posted by Kelvin Tan <ke...@relevanz.com>.

I just took a look at JPedal and I'm very impressed. Extracted some text as 
XML data no problem.

Amazingly also creates thumbnails of the PDF file which is something I've 
needed but couldn't find...:)

Regards,
Kelvin


On Wed, 10 Jul 2002 09:59:32 +0200, Jose Galiana wrote:
>Hi,
>
>I?ve used JPedal ( www.jpedal.org ). I?s distibuited under LGPL
>license and
>extract raw text, among other uses.
>
>I wrote code to extract text using Etymon PJ library, with PDF?s
>withs
>propietary fonts, I needed to create a cross tabla to translate
>Unicode to
>ASCII because Distiller inserts only a subset of Unicode tabla for
>each
>propietary font.
>
>JPedal has not problem with thats fonts and extract all text in XML,
>suitalble for use with Lucene.
>
>
>
>-----Mensaje original-----
>De: Ben Litchfield [mailto:ben@csh.rit.edu]
>Enviado el: martes, 09 de julio de 2002 16:48
>Para: lucene-user@jakarta.apache.org
>Asunto: PDF Text Stripper
>
>
>Hi,
>
>I have written a PDF library that can be used to strip text from PDF
>documents.  It is released under LGPL so have fun.
>
>There is one class which can be used to easily index PDF documents.
>pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
>method which will take a PDF file and return a Lucene Document which
>you
>can add to an index.
>
>If you would like to see the quality of the text extraction you can
>run
>pdfparser.Main from the command line which will take a PDF document
>and
>write a txt file.
>
>I am looking for any input that you might have.  Please mail me if
>you
>have any bugs or feature requests.
>
>The library can be retrieved from
>http://www.csh.rit.edu/~ben/projects/pdfparser/
>
>-Ben Litchfield
>
>
>--
>To unsubscribe, e-mail:
><ma...@jakarta.apache.org>
>For additional commands, e-mail:
><ma...@jakarta.apache.org>
>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-
>unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-
>help@jakarta.apache.org>
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: PDF Text Stripper

Posted by Jose Galiana <jg...@renr.es>.

Hi,

I?ve used JPedal ( www.jpedal.org ). I?s distibuited under LGPL license and
extract raw text, among other uses.

I wrote code to extract text using Etymon PJ library, with PDF?s withs
propietary fonts, I needed to create a cross tabla to translate Unicode to
ASCII because Distiller inserts only a subset of Unicode tabla for each
propietary font.

JPedal has not problem with thats fonts and extract all text in XML,
suitalble for use with Lucene.



-----Mensaje original-----
De: Ben Litchfield [mailto:ben@csh.rit.edu]
Enviado el: martes, 09 de julio de 2002 16:48
Para: lucene-user@jakarta.apache.org
Asunto: PDF Text Stripper


Hi,

I have written a PDF library that can be used to strip text from PDF
documents.  It is released under LGPL so have fun.

There is one class which can be used to easily index PDF documents.
pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
method which will take a PDF file and return a Lucene Document which you
can add to an index.

If you would like to see the quality of the text extraction you can run
pdfparser.Main from the command line which will take a PDF document and
write a txt file.

I am looking for any input that you might have.  Please mail me if you
have any bugs or feature requests.

The library can be retrieved from
http://www.csh.rit.edu/~ben/projects/pdfparser/

-Ben Litchfield


--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: PDF Text Stripper

Posted by Ben Litchfield <be...@csh.rit.edu>.

Can you send me the PDF document that you are having problems with and I
will look into it.

There are still some issues that I am working out with the spacing of
characters.
-Ben



On Tue, 9 Jul 2002, Keith Gunn wrote:

> On Tue, 9 Jul 2002, Ben Litchfield wrote:
>
> > Hi,
> >
> > I have written a PDF library that can be used to strip text from PDF
> > documents.  It is released under LGPL so have fun.
> >
> > There is one class which can be used to easily index PDF documents.
> > pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
> > method which will take a PDF file and return a Lucene Document which you
> > can add to an index.
> >
> > If you would like to see the quality of the text extraction you can run
> > pdfparser.Main from the command line which will take a PDF document and
> > write a txt file.
> >
> > I am looking for any input that you might have.  Please mail me if you
> > have any bugs or feature requests.
> >
> > The library can be retrieved from
> > http://www.csh.rit.edu/~ben/projects/pdfparser/
> >
> > -Ben Litchfield
>
> hi,
>
> I downloaded the zip and quickly ran the demo on a few files, it displays
> .notdef between words and there are spaces between every letter for words,
> is there code in your dist. to remove these so that just terms remain?
>
> Keith Gunn
> University Of Aberdeen
>
>
>
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>

-- 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: PDF Text Stripper

Posted by Keith Gunn <kg...@csd.abdn.ac.uk>.

On Tue, 9 Jul 2002, Ben Litchfield wrote:

> Hi,
>
> I have written a PDF library that can be used to strip text from PDF
> documents.  It is released under LGPL so have fun.
>
> There is one class which can be used to easily index PDF documents.
> pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
> method which will take a PDF file and return a Lucene Document which you
> can add to an index.
>
> If you would like to see the quality of the text extraction you can run
> pdfparser.Main from the command line which will take a PDF document and
> write a txt file.
>
> I am looking for any input that you might have.  Please mail me if you
> have any bugs or feature requests.
>
> The library can be retrieved from
> http://www.csh.rit.edu/~ben/projects/pdfparser/
>
> -Ben Litchfield

hi,

I downloaded the zip and quickly ran the demo on a few files, it displays
.notdef between words and there are spaces between every letter for words,
is there code in your dist. to remove these so that just terms remain?

Keith Gunn
University Of Aberdeen



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>