You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by hi...@wipro.com on 2005/04/06 09:07:51 UTC

How to index and search PDF documents.

Hello sir
Thank u for replying.
But my query is not regarding updating indexes, optimizing indexes and such others.
 
Sorry sir may be my earlier question was not that clear. Here i elaborate my problem clearly. 
I have to develop a search engine for my s website. We have some data on the local common drive which is accessible o everyone. The drive contains directories which further contain html pages, pdf files, power point presenations, xml files, word documents , xls files etc. Now the requirement is that people of my department wnat a website which contains a search engine so that they are able to search any file from this drive.
So i created a website and i am trying to make a search engine for my site.
 
I have used Lucene to create the search engine.As u must be knowing  Lucene first indexes all the files and then we can search them. Till here everything goes fine. But when I integrate lucene with my web server i.e. Apache Tomcat it just indexs Html documents and text documents. But my requirement is to index all files on the common drive i.e. .pdf,.xml,. ppt, .xls etc. For solution to this problem I had reffered to the FAQ's of lucene at the following link http://wiki.apache.org/jakarta-lucene/LuceneFAQ.  There Ii searched for "How to index pdf files" , power point presentations etc. Question 34 is the solution for PDF files. Please do refer to it may be u understand my problem better.
 
It says that we require a parser for this so that it can extract text from PDF  files and convert it into a text file. Now after extracting it, Lucene converts that PDF document to text document. Then Lucene indexes it . Now when I search it in my browser the file is searched and displayed as a text document which is not required. I want it to be display it as PDF documents. So please give me a solution for this and tell me how can this problem be solved.
 
"THAT is  HOW CAN I INDEX  and SEARCH .pdf, .ppt,. xml, .doc etc DOCUMENTS  WITH LUCENE."
I WILL BE REALLY HANKFUL IF U SOLVE MY PROBLEM.
 
 
Regards,
Himani Tandon
Project Engineer
Wipro Technologies(Gurgaon)
Email : himani.tandon@wipro.com <ma...@wipro.com> 

________________________________

From: Chuck Williams [mailto:chuck@allthingslocal.com]
Sent: Wed 4/6/2005 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: wildcarded phrase queries



Erik Hatcher writes (4/5/2005 5:57 PM):

> I have a need to implement wildcarded phrase queries, such as this:
>
>     "apach? luc*"
>
> which would match "apache lucene", for example.  This needs to also
> support ordered and unordered proximity like SpanNearQuery does:
>
>     "apach? luc*"~10
>
> I presume I'm going to have to key off of SpanQuery with a some
> specialized subclasses.
>
> What approach do you recommend for implementing something like this?

Hi Erik,

Might it be as easy as creating a SpanWilcardQuery that transforms into
a SpanOrQuery of SpanTermQuery's, and then use a SpanNearQuery of
SpanWildcardQuery's?  You could use a WildcardTermEnum.to generate the
list of terms for the SpanOrQuery.  This would have some issues like
computing the idf as the sum of all the pattern-matched terms, but it
looks like that issue still exists with WildcardQuery too.  I haven't
done much with SpanQuery's so this might not work out so simply, or be
acceptably efficient.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




Re: How to index and search PDF documents.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 6, 2005, at 3:07 AM, <hi...@wipro.com> wrote:
> "THAT is  HOW CAN I INDEX  and SEARCH .pdf, .ppt,. xml, .doc etc 
> DOCUMENTS  WITH LUCENE."
> I WILL BE REALLY HANKFUL IF U SOLVE MY PROBLEM.

<commercial>

Get a copy of Lucene in Action.  Otis wrote a great chapter on how to 
handle various document formats with Lucene:

	http://www.lucenebook.com/search?query=index+pdf

</commercial>

It sounds like you're using the Lucene demo application.  That is a 
reasonable starting point, but you will very quickly want to diverge 
from it and integrate in a PDF reader (PDFBox being the best open 
source library) and something to read Office documents (TextMining for 
Word docs, perhaps POI for the other formats).

Also, have a look at Nutch - it may actually be the indexing/searching 
solution you want for your intranet rather than a custom Lucene 
application.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: How to index and search PDF documents.

Posted by Chandrashekhar <ch...@cybage.com>.
Hi Himani,
Your search result should have reference to your documents (like you can
content/document id) and then add rendering logic to render such contents
after you click on some link.

Regards,
Chandra

-----Original Message-----
From: himani.tandon@wipro.com [mailto:himani.tandon@wipro.com]
Sent: Wednesday, April 06, 2005 12:38 PM
To: java-user@lucene.apache.org
Subject: How to index and search PDF documents.




Confidentiality Notice

The information contained in this electronic message and any attachments to
this message are intended
for the exclusive use of the addressee(s) and may contain confidential or
privileged information. If
you are not the intended recipient, please notify the sender at Wipro or
Mailadmin@wipro.com immediately
and destroy all copies of this message and any attachments.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org