You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ashwin kumar <gv...@gmail.com> on 2007/03/08 10:37:28 UTC

indexing pdfs

hi can some one help me by giving any sample programs for indexing pdfs and
.doc files

thanks
regards
ashwin

Re: indexing pdfs

Posted by Ulf Dittmer <ul...@ulfdittmer.com>.

For DOC files you can use the Jakarta POI library. Text extraction is  
outlined here: http://jakarta.apache.org/poi/hwpf/quick-guide.html

Ulf

On 08.03.2007, at 10:37, ashwin kumar wrote:

> hi can some one help me by giving any sample programs for indexing  
> pdfs and .doc files

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WilcardQuery and memory

Posted by Erick Erickson <er...@gmail.com>.

You can also use a filter. The basic idea is to construct a Lucene
Filter, probably using something like RegexTermEnum/TermDocs.
It's faster than you think <G>. This, in combination with
ConstantScoreQuery should fix you right up. Several things:

1> you lose scoring with the filter part of a query with this technique.
If you search on, say erick AND h*s, your query would be scored
entirely by the 'erick' part of the clause. But you'd match documents
with 'erick' and any of the terms 'hoss' 'hiss' 'his' 'hers' .....

2> the size is capped by the total docs / 8, since a filter is really
a bitmap with one bit position for each doc in your index.

3> Filters can be cached, see CachingWrapperFilter

Hope this helps
Erick

On 3/9/07, Rob Staveley (Tom) <rs...@seseit.com> wrote:
>
> For indexing e-mail, I recommend that you tokenise the e-mail addresses
> into
> fragments and query on the fragments as whole terms rather than using
> wildcards.
>
> Rather than looking for fischauto333* in (say) smtp-from, look for
> fischauto333 in (say) an additional field called smtp-from-fragments to
> match the term fischauto333 (it also contains the terms yahoo.de and yahoo
> and de).
>
> Having whole e-mail addresses in terms and using prefix/wildcard queries
> inevitably results in too many clauses.
>
>
> -----Original Message-----
> From: Joe [mailto:fischauto333@yahoo.de]
> Sent: 09 March 2007 12:08
> To: java-user@lucene.apache.org
> Subject: WilcardQuery and memory
>
> Hi,
>
> Here we use lucene to index our emails, currently 500.000 Documents.
> When Searching the body by a WildcardQuery the problems arises.
>
> I did some profiling with JProfiler. I see the more BooleanClause
> instances used
> the more memory is required during search.
> Most memory is used by instances of classes SegementTermDocs and
> CoumpundFileRader.CSIndexInput
> Nearly 30 MB/1000 BooleanClause instances.
>
> So when i limit the BooleanClauses by BooleanQuery.setMaxClauseCount
> (5000);
> and more and more emails gets indexed, will this lead to a
> OutOfMemoryError
> or will the 30MB/1000 Clauses ratio still hold?
>
> What should i do to prevent OOM without reducing
> BooleanQuery.setMaxClauseCount() to much?
>
>
>
>
>
>
> ___________________________________________________________
> Der frühe Vogel fängt den Wurm. Hier gelangen Sie zum neuen Yahoo! Mail:
> http://mail.yahoo.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: WilcardQuery and memory

Posted by Joe <fi...@yahoo.de>.

Hi Rob,
> For indexing e-mail, I recommend that you tokenise the e-mail addresses into
> fragments and query on the fragments as whole terms rather than using
> wildcards. 
>
> [example]
>   
Hm for email adresses this isnt a big problem here.
The real problem is the query on the body part of an email, whis is 
added to the index in this way:
doc.add(new Field(DocumentFields.BODY, filter.filter(fields.body), 
Field.Store.NO, Field.Index.TOKENIZED));



		
___________________________________________________________ 
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: WilcardQuery and memory

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

For indexing e-mail, I recommend that you tokenise the e-mail addresses into
fragments and query on the fragments as whole terms rather than using
wildcards. 

Rather than looking for fischauto333* in (say) smtp-from, look for
fischauto333 in (say) an additional field called smtp-from-fragments to
match the term fischauto333 (it also contains the terms yahoo.de and yahoo
and de).

Having whole e-mail addresses in terms and using prefix/wildcard queries
inevitably results in too many clauses.


-----Original Message-----
From: Joe [mailto:fischauto333@yahoo.de] 
Sent: 09 March 2007 12:08
To: java-user@lucene.apache.org
Subject: WilcardQuery and memory

Hi,

Here we use lucene to index our emails, currently 500.000 Documents.
When Searching the body by a WildcardQuery the problems arises.

I did some profiling with JProfiler. I see the more BooleanClause 
instances used
the more memory is required during search.
Most memory is used by instances of classes SegementTermDocs and 
CoumpundFileRader.CSIndexInput
Nearly 30 MB/1000 BooleanClause instances.

So when i limit the BooleanClauses by BooleanQuery.setMaxClauseCount(5000);
and more and more emails gets indexed, will this lead to a OutOfMemoryError
or will the 30MB/1000 Clauses ratio still hold?

What should i do to prevent OOM without reducing 
BooleanQuery.setMaxClauseCount() to much?




	
		
___________________________________________________________ 
Der frühe Vogel fängt den Wurm. Hier gelangen Sie zum neuen Yahoo! Mail:
http://mail.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

WilcardQuery and memory

Posted by Joe <fi...@yahoo.de>.

Hi,

Here we use lucene to index our emails, currently 500.000 Documents.
When Searching the body by a WildcardQuery the problems arises.

I did some profiling with JProfiler. I see the more BooleanClause 
instances used
the more memory is required during search.
Most memory is used by instances of classes SegementTermDocs and 
CoumpundFileRader.CSIndexInput
Nearly 30 MB/1000 BooleanClause instances.

So when i limit the BooleanClauses by BooleanQuery.setMaxClauseCount(5000);
and more and more emails gets indexed, will this lead to a OutOfMemoryError
or will the 30MB/1000 Clauses ratio still hold?

What should i do to prevent OOM without reducing 
BooleanQuery.setMaxClauseCount() to much?




	
		
___________________________________________________________ 
Der frühe Vogel fängt den Wurm. Hier gelangen Sie zum neuen Yahoo! Mail: http://mail.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: indexing pdfs

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.

Hi Ashwin,

Well in that case you might need to use Ifilters some other way instead
of through SeekAFile.  I don't know how since I haven't used it myself.
Perhaps someone else here has.

Sachin

-----Original Message-----
From: ashwin kumar [mailto:gv.ashwin@gmail.com] 
Sent: 09 March 2007 02:48
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

hi sachin the link wat u gave me only a zip file and an exe file for
downoad. and this zip file also contains no class files.but wouldn't we
be requiring a jar file or class file ???

On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi,
>
> Here it is:
>
> http://www.seekafile.org/
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 13:07
> To: java-user@lucene.apache.org
> Subject: Re: indexing pdfs
>
> hi again
> do we have to download any jar files to run this program if so can u 
> give me the link pls
>
> ashwin
>
> On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > Well you don't need to actually save the text to disk and then index

> > the saved index file, you can directly index that text in-memory.
> >
> > The only other way I have heard of is to use Ifilters.  I believe 
> > SeekAFile does indexing of pdfs.
> >
> > Sachin
> >
> > -----Original Message-----
> > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > Sent: 08 March 2007 11:35
> > To: java-user@lucene.apache.org
> > Subject: Re: indexing pdfs
> >
> > Is the only way index pdfs is to convert it into a text and then 
> > only index it ???
> >
> >
> >
> > On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> > >
> > > Hi Aswin,
> > >
> > > You can try pdfbox to convert the pdf documents to text and then 
> > > use
>
> > > Lucene to index the text.  The code for turning a pdf to text is 
> > > very
> > > simple:
> > >
> > > private static string parseUsingPDFBox(string filename)
> > >         {
> > >             // document reader
> > >             PDDocument doc = PDDocument.load(filename);
> > >             // create stripper (wish I had the power to do that - 
> > > wouldn't leave the house)
> > >             PDFTextStripper stripper = new PDFTextStripper();
> > >             // get text from doc using stripper
> > >             return stripper.getText(doc);
> > >         }
> > >
> > > Sachin
> > >
> > > -----Original Message-----
> > > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > > Sent: 08 March 2007 09:37
> > > To: java-user@lucene.apache.org
> > > Subject: indexing pdfs
> > >
> > > hi can some one help me by giving any sample programs for indexing

> > > pdfs and .doc files
> > >
> > > thanks
> > > regards
> > > ashwin
> > >
> > >
> > > This message has been scanned for viruses by MailControl - (see
> > > http://bluepages.wsatkins.co.uk/?6875772)
> > >
> > >
> > > This email and any attached files are confidential and copyright 
> > > protected. If you are not the addressee, any dissemination of this

> > > communication is strictly prohibited. Unless otherwise expressly 
> > > agreed in writing, nothing stated in this communication shall be
> > legally binding.
> > >
> > > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > > Registered in England No. 1885586.  Registered Office Woodcote 
> > > Grove, Ashley Road, Epsom, Surrey KT18 5BW.
> > >
> > > Consider the environment. Please don't print this e-mail unless 
> > > you really need to.
> > >
> > > ------------------------------------------------------------------
> > > --
> > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing pdfs

Posted by ashwin kumar <gv...@gmail.com>.

hi sachin the link wat u gave me only a zip file and an exe file for
downoad. and this zip file also contains no class files.but wouldn't we be
requiring a jar file or class file ???

On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi,
>
> Here it is:
>
> http://www.seekafile.org/
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 13:07
> To: java-user@lucene.apache.org
> Subject: Re: indexing pdfs
>
> hi again
> do we have to download any jar files to run this program if so can u
> give me the link pls
>
> ashwin
>
> On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > Well you don't need to actually save the text to disk and then index
> > the saved index file, you can directly index that text in-memory.
> >
> > The only other way I have heard of is to use Ifilters.  I believe
> > SeekAFile does indexing of pdfs.
> >
> > Sachin
> >
> > -----Original Message-----
> > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > Sent: 08 March 2007 11:35
> > To: java-user@lucene.apache.org
> > Subject: Re: indexing pdfs
> >
> > Is the only way index pdfs is to convert it into a text and then only
> > index it ???
> >
> >
> >
> > On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> > >
> > > Hi Aswin,
> > >
> > > You can try pdfbox to convert the pdf documents to text and then use
>
> > > Lucene to index the text.  The code for turning a pdf to text is
> > > very
> > > simple:
> > >
> > > private static string parseUsingPDFBox(string filename)
> > >         {
> > >             // document reader
> > >             PDDocument doc = PDDocument.load(filename);
> > >             // create stripper (wish I had the power to do that -
> > > wouldn't leave the house)
> > >             PDFTextStripper stripper = new PDFTextStripper();
> > >             // get text from doc using stripper
> > >             return stripper.getText(doc);
> > >         }
> > >
> > > Sachin
> > >
> > > -----Original Message-----
> > > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > > Sent: 08 March 2007 09:37
> > > To: java-user@lucene.apache.org
> > > Subject: indexing pdfs
> > >
> > > hi can some one help me by giving any sample programs for indexing
> > > pdfs and .doc files
> > >
> > > thanks
> > > regards
> > > ashwin
> > >
> > >
> > > This message has been scanned for viruses by MailControl - (see
> > > http://bluepages.wsatkins.co.uk/?6875772)
> > >
> > >
> > > This email and any attached files are confidential and copyright
> > > protected. If you are not the addressee, any dissemination of this
> > > communication is strictly prohibited. Unless otherwise expressly
> > > agreed in writing, nothing stated in this communication shall be
> > legally binding.
> > >
> > > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > > Registered in England No. 1885586.  Registered Office Woodcote
> > > Grove, Ashley Road, Epsom, Surrey KT18 5BW.
> > >
> > > Consider the environment. Please don't print this e-mail unless you
> > > really need to.
> > >
> > > --------------------------------------------------------------------
> > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: indexing pdfs

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.

Hi,

Here it is:

http://www.seekafile.org/ 

-----Original Message-----
From: ashwin kumar [mailto:gv.ashwin@gmail.com] 
Sent: 08 March 2007 13:07
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

hi again
do we have to download any jar files to run this program if so can u
give me the link pls

ashwin

On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Well you don't need to actually save the text to disk and then index 
> the saved index file, you can directly index that text in-memory.
>
> The only other way I have heard of is to use Ifilters.  I believe 
> SeekAFile does indexing of pdfs.
>
> Sachin
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 11:35
> To: java-user@lucene.apache.org
> Subject: Re: indexing pdfs
>
> Is the only way index pdfs is to convert it into a text and then only 
> index it ???
>
>
>
> On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > Hi Aswin,
> >
> > You can try pdfbox to convert the pdf documents to text and then use

> > Lucene to index the text.  The code for turning a pdf to text is 
> > very
> > simple:
> >
> > private static string parseUsingPDFBox(string filename)
> >         {
> >             // document reader
> >             PDDocument doc = PDDocument.load(filename);
> >             // create stripper (wish I had the power to do that - 
> > wouldn't leave the house)
> >             PDFTextStripper stripper = new PDFTextStripper();
> >             // get text from doc using stripper
> >             return stripper.getText(doc);
> >         }
> >
> > Sachin
> >
> > -----Original Message-----
> > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > Sent: 08 March 2007 09:37
> > To: java-user@lucene.apache.org
> > Subject: indexing pdfs
> >
> > hi can some one help me by giving any sample programs for indexing 
> > pdfs and .doc files
> >
> > thanks
> > regards
> > ashwin
> >
> >
> > This message has been scanned for viruses by MailControl - (see
> > http://bluepages.wsatkins.co.uk/?6875772)
> >
> >
> > This email and any attached files are confidential and copyright 
> > protected. If you are not the addressee, any dissemination of this 
> > communication is strictly prohibited. Unless otherwise expressly 
> > agreed in writing, nothing stated in this communication shall be
> legally binding.
> >
> > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > Registered in England No. 1885586.  Registered Office Woodcote 
> > Grove, Ashley Road, Epsom, Surrey KT18 5BW.
> >
> > Consider the environment. Please don't print this e-mail unless you 
> > really need to.
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing pdfs

Posted by ashwin kumar <gv...@gmail.com>.

hi again
do we have to download any jar files to run this program if so can u give me
the link pls

ashwin

On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Well you don't need to actually save the text to disk and then index the
> saved index file, you can directly index that text in-memory.
>
> The only other way I have heard of is to use Ifilters.  I believe
> SeekAFile does indexing of pdfs.
>
> Sachin
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 11:35
> To: java-user@lucene.apache.org
> Subject: Re: indexing pdfs
>
> Is the only way index pdfs is to convert it into a text and then only
> index it ???
>
>
>
> On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > Hi Aswin,
> >
> > You can try pdfbox to convert the pdf documents to text and then use
> > Lucene to index the text.  The code for turning a pdf to text is very
> > simple:
> >
> > private static string parseUsingPDFBox(string filename)
> >         {
> >             // document reader
> >             PDDocument doc = PDDocument.load(filename);
> >             // create stripper (wish I had the power to do that -
> > wouldn't leave the house)
> >             PDFTextStripper stripper = new PDFTextStripper();
> >             // get text from doc using stripper
> >             return stripper.getText(doc);
> >         }
> >
> > Sachin
> >
> > -----Original Message-----
> > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > Sent: 08 March 2007 09:37
> > To: java-user@lucene.apache.org
> > Subject: indexing pdfs
> >
> > hi can some one help me by giving any sample programs for indexing
> > pdfs and .doc files
> >
> > thanks
> > regards
> > ashwin
> >
> >
> > This message has been scanned for viruses by MailControl - (see
> > http://bluepages.wsatkins.co.uk/?6875772)
> >
> >
> > This email and any attached files are confidential and copyright
> > protected. If you are not the addressee, any dissemination of this
> > communication is strictly prohibited. Unless otherwise expressly
> > agreed in writing, nothing stated in this communication shall be
> legally binding.
> >
> > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > Registered in England No. 1885586.  Registered Office Woodcote Grove,
> > Ashley Road, Epsom, Surrey KT18 5BW.
> >
> > Consider the environment. Please don't print this e-mail unless you
> > really need to.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: indexing pdfs

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.

Well you don't need to actually save the text to disk and then index the
saved index file, you can directly index that text in-memory. 

The only other way I have heard of is to use Ifilters.  I believe
SeekAFile does indexing of pdfs.

Sachin 

-----Original Message-----
From: ashwin kumar [mailto:gv.ashwin@gmail.com] 
Sent: 08 March 2007 11:35
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

Is the only way index pdfs is to convert it into a text and then only
index it ???



On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi Aswin,
>
> You can try pdfbox to convert the pdf documents to text and then use 
> Lucene to index the text.  The code for turning a pdf to text is very
> simple:
>
> private static string parseUsingPDFBox(string filename)
>         {
>             // document reader
>             PDDocument doc = PDDocument.load(filename);
>             // create stripper (wish I had the power to do that - 
> wouldn't leave the house)
>             PDFTextStripper stripper = new PDFTextStripper();
>             // get text from doc using stripper
>             return stripper.getText(doc);
>         }
>
> Sachin
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 09:37
> To: java-user@lucene.apache.org
> Subject: indexing pdfs
>
> hi can some one help me by giving any sample programs for indexing 
> pdfs and .doc files
>
> thanks
> regards
> ashwin
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?6875772)
>
>
> This email and any attached files are confidential and copyright 
> protected. If you are not the addressee, any dissemination of this 
> communication is strictly prohibited. Unless otherwise expressly 
> agreed in writing, nothing stated in this communication shall be
legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.  
> Registered in England No. 1885586.  Registered Office Woodcote Grove, 
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you 
> really need to.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing pdfs

Posted by ashwin kumar <gv...@gmail.com>.

Is the only way index pdfs is to convert it into a text and then only index
it ???



On 3/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi Aswin,
>
> You can try pdfbox to convert the pdf documents to text and then use
> Lucene to index the text.  The code for turning a pdf to text is very
> simple:
>
> private static string parseUsingPDFBox(string filename)
>         {
>             // document reader
>             PDDocument doc = PDDocument.load(filename);
>             // create stripper (wish I had the power to do that -
> wouldn't leave the house)
>             PDFTextStripper stripper = new PDFTextStripper();
>             // get text from doc using stripper
>             return stripper.getText(doc);
>         }
>
> Sachin
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 09:37
> To: java-user@lucene.apache.org
> Subject: indexing pdfs
>
> hi can some one help me by giving any sample programs for indexing pdfs
> and .doc files
>
> thanks
> regards
> ashwin
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?6875772)
>
>
> This email and any attached files are confidential and copyright
> protected. If you are not the addressee, any dissemination of this
> communication is strictly prohibited. Unless otherwise expressly agreed in
> writing, nothing stated in this communication shall be legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins
> plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you really
> need to.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: indexing pdfs

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.

Hi Aswin,

You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text.  The code for turning a pdf to text is very
simple:

private static string parseUsingPDFBox(string filename)
        {
            // document reader
            PDDocument doc = PDDocument.load(filename);
            // create stripper (wish I had the power to do that -
wouldn't leave the house)
            PDFTextStripper stripper = new PDFTextStripper();
            // get text from doc using stripper
            return stripper.getText(doc);
        }

Sachin

-----Original Message-----
From: ashwin kumar [mailto:gv.ashwin@gmail.com] 
Sent: 08 March 2007 09:37
To: java-user@lucene.apache.org
Subject: indexing pdfs

hi can some one help me by giving any sample programs for indexing pdfs
and .doc files

thanks
regards
ashwin


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need to. 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org