You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vijay Veeraraghavan <vi...@gmail.com> on 2010/05/03 12:21:12 UTC

Using IndexReader in the web environment

Hi all,

In a clustered environment I search the index from the web
application. In the web application I am creating IndexReader on each
request. is it expensive to do like this? I read somewhere in the web
that try using the same reader as much as possible. Can i keep the
initially created IndexReader in the session/application scopes and
use the same for each request? Any other idea?

Viay

On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
> dear all,
>
> as replied below, does searching again for the document in the index
> and if found skip the indexing else index it, is this not similar to
> indexing all pdf documents once again, is not this overhead? As I am
> not going to index the details of the pdf (so if an indexed pdf was
> recreated i need not reindex it) but just the paths of the documents.
>
> Vijay
>
>>> Hey there,
>>>
>>> you might have to implement a some kind of unique identifier using an
>>> indexed lucene field. When you are indexing you should fire a query with
>>> the
>>> uuid of your document (maybe the path to you pdf document) and check if
>>> the
>>> document is in the index already. You could also do a boolean query
>>> combining UUID, timestamp and / or a hash value to see if the document
>>> has
>>> been changed. if so you can simply update the document by its UUID
>>> (something like indexwriter.updateDocument(new Term("uuid",
>>> value),document);)
>>>
>>> Unfortunately you have to implement this yourself but it should not be
>>> that
>>> much of a deal.
>>>
>>> simon
>>>
>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>> vijay.raghavan08@gmail.com> wrote:
>>>
>>>> Dear all,
>>>> I am using lucene 3.0 to index the pdf reports that I generate
>>>> dynamically. I index the pdf file name (without extension), file path
>>>> and its absolute path as fields. I search with the file name without
>>>> extension; it retrieves a list, as usually 2 or more files are present
>>>> in the same name in different sub directories. As I create the index
>>>> for the first time it updates, assuming 100 pdf files in different
>>>> directories, the files meta info. If again I do indexing, while my
>>>> report generator scheduler has the produced 500 more pdf files
>>>> totaling to 600 files in different directories, I wish to index only
>>>> the new files to the index. But presently it’s doing the whole thing
>>>> again (600 files). How to implement this functionality? Think of the
>>>> thousands of pdf files created on each run.
>>>>
>>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>>> memory, as it exceeds thousands in a single run, and update the index
>>>> looping this list.
>>>>
>>>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>>>>                                        Version.LUCENE_CURRENT), true,
>>>>
>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>
>>>> is the boolean parameter is for this purpose? Please guide me.
>>>>
>>>> --
>>>> Thanks
>>>> Vijay Veeraraghavan
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Vijay Veeraraghavan
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
>>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>


-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using IndexReader in the web environment

Posted by Ivan Liu <ja...@gmail.com>.

You may look this:
private static IndexSearcher indexSearcher = null;

 public synchronized IndexSearcher newIndexSearcher() {
  try {

   if (null == indexSearcher) {
    Directory directory = FSDirectory.open(new
File(Config.DB_DIR+"/rssindex"));
    indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
   } else {
    IndexReader indexReader = indexSearcher.getIndexReader();
    IndexReader newIndexReader = indexReader.reopen();//reopen old
indexReader
    if (newIndexReader!=indexReader) {
     indexReader.close();
     indexSearcher.close();


     indexSearcher = new IndexSearcher(newIndexReader);
    }
   }
   return indexSearcher;
  } catch (CorruptIndexException e) {
   log.error(e.getMessage(),e);
   return null;
  } catch (IOException e) {
   log.error(e.getMessage(),e);
   return null;
  }
 }



-- 
冲浪板

my blog:http://chonglangban.appspot.com/
my site:http://kejiblog.appspot.com/

Re: Using IndexReader in the web environment

Posted by Ian Lea <ia...@gmail.com>.

You could tell the searching part of your app, via some notification
or messaging call.  Or call IndexReader.isCurrent() from time to time,
or even on every search, and reopen() if necessary.  See the javadocs
and don't forget to close the old reader when you do call reopen.


--
Ian.


On Wed, May 5, 2010 at 5:17 AM, Vijay Veeraraghavan
<vi...@gmail.com> wrote:
> hey Ian,
> thanks for the reply. I find it very useful. My report generating
> scheduler will run periodically, once done it will invoke the indexer
> and exit. In this case I do not know if the index has changed or not.
> How do i keep track of the changes in the index? As the two entities,
> scheduler/indexer and the web application, are totally different.
>
> Vijay
>
> On 5/4/10, Ian Lea <ia...@gmail.com> wrote:
>> For best performance you should aim to keep a shared index searcher,
>> or the underlying index reader, open as long as possible.  You may of
>> course need to reopen it if/when the index changes.  As to scope, you
>> can store it wherever it makes sense for your application.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, May 4, 2010 at 10:13 AM, Vijay Veeraraghavan
>> <vi...@gmail.com> wrote:
>>> Hi,
>>> Thanks for the reply. So I will have a dedicated servlet to search the
>>> index, but does it mean that the indexsearcher does not close the
>>> index, keep it open? Is it not possible to keep it in the application
>>> scope?
>>>
>>> Vijay
>>>
>>> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> In a clustered environment I search the index from the web
>>>> application. In the web application I am creating IndexReader on each
>>>> request. is it expensive to do like this? I read somewhere in the web
>>>> that try using the same reader as much as possible. Can i keep the
>>>> initially created IndexReader in the session/application scopes and
>>>> use the same for each request? Any other idea?
>>>>
>>>> Viay
>>>>
>>>> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>>>>> dear all,
>>>>>
>>>>> as replied below, does searching again for the document in the index
>>>>> and if found skip the indexing else index it, is this not similar to
>>>>> indexing all pdf documents once again, is not this overhead? As I am
>>>>> not going to index the details of the pdf (so if an indexed pdf was
>>>>> recreated i need not reindex it) but just the paths of the documents.
>>>>>
>>>>> Vijay
>>>>>
>>>>>>> Hey there,
>>>>>>>
>>>>>>> you might have to implement a some kind of unique identifier using an
>>>>>>> indexed lucene field. When you are indexing you should fire a query
>>>>>>> with
>>>>>>> the
>>>>>>> uuid of your document (maybe the path to you pdf document) and check
>>>>>>> if
>>>>>>> the
>>>>>>> document is in the index already. You could also do a boolean query
>>>>>>> combining UUID, timestamp and / or a hash value to see if the document
>>>>>>> has
>>>>>>> been changed. if so you can simply update the document by its UUID
>>>>>>> (something like indexwriter.updateDocument(new Term("uuid",
>>>>>>> value),document);)
>>>>>>>
>>>>>>> Unfortunately you have to implement this yourself but it should not be
>>>>>>> that
>>>>>>> much of a deal.
>>>>>>>
>>>>>>> simon
>>>>>>>
>>>>>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>>>>>> vijay.raghavan08@gmail.com> wrote:
>>>>>>>
>>>>>>>> Dear all,
>>>>>>>> I am using lucene 3.0 to index the pdf reports that I generate
>>>>>>>> dynamically. I index the pdf file name (without extension), file path
>>>>>>>> and its absolute path as fields. I search with the file name without
>>>>>>>> extension; it retrieves a list, as usually 2 or more files are
>>>>>>>> present
>>>>>>>> in the same name in different sub directories. As I create the index
>>>>>>>> for the first time it updates, assuming 100 pdf files in different
>>>>>>>> directories, the files meta info. If again I do indexing, while my
>>>>>>>> report generator scheduler has the produced 500 more pdf files
>>>>>>>> totaling to 600 files in different directories, I wish to index only
>>>>>>>> the new files to the index. But presently it’s doing the whole thing
>>>>>>>> again (600 files). How to implement this functionality? Think of the
>>>>>>>> thousands of pdf files created on each run.
>>>>>>>>
>>>>>>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>>>>>>> memory, as it exceeds thousands in a single run, and update the index
>>>>>>>> looping this list.
>>>>>>>>
>>>>>>>> new IndexWriter(FSDirectory.open(this.indexDir), new
>>>>>>>> StandardAnalyzer(
>>>>>>>>                                        Version.LUCENE_CURRENT), true,
>>>>>>>>
>>>>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>>>>
>>>>>>>> is the boolean parameter is for this purpose? Please guide me.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks
>>>>>>>> Vijay Veeraraghavan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards
>>>>>>>> Vijay Veeraraghavan
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards
>>>>>> Vijay Veeraraghavan
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards
>>>>> Vijay Veeraraghavan
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Vijay Veeraraghavan
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Vijay Veeraraghavan
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using IndexReader in the web environment

Posted by Vijay Veeraraghavan <vi...@gmail.com>.

hey Ian,
thanks for the reply. I find it very useful. My report generating
scheduler will run periodically, once done it will invoke the indexer
and exit. In this case I do not know if the index has changed or not.
How do i keep track of the changes in the index? As the two entities,
scheduler/indexer and the web application, are totally different.

Vijay

On 5/4/10, Ian Lea <ia...@gmail.com> wrote:
> For best performance you should aim to keep a shared index searcher,
> or the underlying index reader, open as long as possible.  You may of
> course need to reopen it if/when the index changes.  As to scope, you
> can store it wherever it makes sense for your application.
>
>
> --
> Ian.
>
>
> On Tue, May 4, 2010 at 10:13 AM, Vijay Veeraraghavan
> <vi...@gmail.com> wrote:
>> Hi,
>> Thanks for the reply. So I will have a dedicated servlet to search the
>> index, but does it mean that the indexsearcher does not close the
>> index, keep it open? Is it not possible to keep it in the application
>> scope?
>>
>> Vijay
>>
>> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>>> Hi all,
>>>
>>> In a clustered environment I search the index from the web
>>> application. In the web application I am creating IndexReader on each
>>> request. is it expensive to do like this? I read somewhere in the web
>>> that try using the same reader as much as possible. Can i keep the
>>> initially created IndexReader in the session/application scopes and
>>> use the same for each request? Any other idea?
>>>
>>> Viay
>>>
>>> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>>>> dear all,
>>>>
>>>> as replied below, does searching again for the document in the index
>>>> and if found skip the indexing else index it, is this not similar to
>>>> indexing all pdf documents once again, is not this overhead? As I am
>>>> not going to index the details of the pdf (so if an indexed pdf was
>>>> recreated i need not reindex it) but just the paths of the documents.
>>>>
>>>> Vijay
>>>>
>>>>>> Hey there,
>>>>>>
>>>>>> you might have to implement a some kind of unique identifier using an
>>>>>> indexed lucene field. When you are indexing you should fire a query
>>>>>> with
>>>>>> the
>>>>>> uuid of your document (maybe the path to you pdf document) and check
>>>>>> if
>>>>>> the
>>>>>> document is in the index already. You could also do a boolean query
>>>>>> combining UUID, timestamp and / or a hash value to see if the document
>>>>>> has
>>>>>> been changed. if so you can simply update the document by its UUID
>>>>>> (something like indexwriter.updateDocument(new Term("uuid",
>>>>>> value),document);)
>>>>>>
>>>>>> Unfortunately you have to implement this yourself but it should not be
>>>>>> that
>>>>>> much of a deal.
>>>>>>
>>>>>> simon
>>>>>>
>>>>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>>>>> vijay.raghavan08@gmail.com> wrote:
>>>>>>
>>>>>>> Dear all,
>>>>>>> I am using lucene 3.0 to index the pdf reports that I generate
>>>>>>> dynamically. I index the pdf file name (without extension), file path
>>>>>>> and its absolute path as fields. I search with the file name without
>>>>>>> extension; it retrieves a list, as usually 2 or more files are
>>>>>>> present
>>>>>>> in the same name in different sub directories. As I create the index
>>>>>>> for the first time it updates, assuming 100 pdf files in different
>>>>>>> directories, the files meta info. If again I do indexing, while my
>>>>>>> report generator scheduler has the produced 500 more pdf files
>>>>>>> totaling to 600 files in different directories, I wish to index only
>>>>>>> the new files to the index. But presently it’s doing the whole thing
>>>>>>> again (600 files). How to implement this functionality? Think of the
>>>>>>> thousands of pdf files created on each run.
>>>>>>>
>>>>>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>>>>>> memory, as it exceeds thousands in a single run, and update the index
>>>>>>> looping this list.
>>>>>>>
>>>>>>> new IndexWriter(FSDirectory.open(this.indexDir), new
>>>>>>> StandardAnalyzer(
>>>>>>>                                        Version.LUCENE_CURRENT), true,
>>>>>>>
>>>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>>>
>>>>>>> is the boolean parameter is for this purpose? Please guide me.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks
>>>>>>> Vijay Veeraraghavan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards
>>>>>>> Vijay Veeraraghavan
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards
>>>>> Vijay Veeraraghavan
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Vijay Veeraraghavan
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Vijay Veeraraghavan
>>>
>>
>>
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using IndexReader in the web environment

Posted by Ian Lea <ia...@gmail.com>.

For best performance you should aim to keep a shared index searcher,
or the underlying index reader, open as long as possible.  You may of
course need to reopen it if/when the index changes.  As to scope, you
can store it wherever it makes sense for your application.


--
Ian.


On Tue, May 4, 2010 at 10:13 AM, Vijay Veeraraghavan
<vi...@gmail.com> wrote:
> Hi,
> Thanks for the reply. So I will have a dedicated servlet to search the
> index, but does it mean that the indexsearcher does not close the
> index, keep it open? Is it not possible to keep it in the application
> scope?
>
> Vijay
>
> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>> Hi all,
>>
>> In a clustered environment I search the index from the web
>> application. In the web application I am creating IndexReader on each
>> request. is it expensive to do like this? I read somewhere in the web
>> that try using the same reader as much as possible. Can i keep the
>> initially created IndexReader in the session/application scopes and
>> use the same for each request? Any other idea?
>>
>> Viay
>>
>> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>>> dear all,
>>>
>>> as replied below, does searching again for the document in the index
>>> and if found skip the indexing else index it, is this not similar to
>>> indexing all pdf documents once again, is not this overhead? As I am
>>> not going to index the details of the pdf (so if an indexed pdf was
>>> recreated i need not reindex it) but just the paths of the documents.
>>>
>>> Vijay
>>>
>>>>> Hey there,
>>>>>
>>>>> you might have to implement a some kind of unique identifier using an
>>>>> indexed lucene field. When you are indexing you should fire a query
>>>>> with
>>>>> the
>>>>> uuid of your document (maybe the path to you pdf document) and check if
>>>>> the
>>>>> document is in the index already. You could also do a boolean query
>>>>> combining UUID, timestamp and / or a hash value to see if the document
>>>>> has
>>>>> been changed. if so you can simply update the document by its UUID
>>>>> (something like indexwriter.updateDocument(new Term("uuid",
>>>>> value),document);)
>>>>>
>>>>> Unfortunately you have to implement this yourself but it should not be
>>>>> that
>>>>> much of a deal.
>>>>>
>>>>> simon
>>>>>
>>>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>>>> vijay.raghavan08@gmail.com> wrote:
>>>>>
>>>>>> Dear all,
>>>>>> I am using lucene 3.0 to index the pdf reports that I generate
>>>>>> dynamically. I index the pdf file name (without extension), file path
>>>>>> and its absolute path as fields. I search with the file name without
>>>>>> extension; it retrieves a list, as usually 2 or more files are present
>>>>>> in the same name in different sub directories. As I create the index
>>>>>> for the first time it updates, assuming 100 pdf files in different
>>>>>> directories, the files meta info. If again I do indexing, while my
>>>>>> report generator scheduler has the produced 500 more pdf files
>>>>>> totaling to 600 files in different directories, I wish to index only
>>>>>> the new files to the index. But presently it’s doing the whole thing
>>>>>> again (600 files). How to implement this functionality? Think of the
>>>>>> thousands of pdf files created on each run.
>>>>>>
>>>>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>>>>> memory, as it exceeds thousands in a single run, and update the index
>>>>>> looping this list.
>>>>>>
>>>>>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>>>>>>                                        Version.LUCENE_CURRENT), true,
>>>>>>
>>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>>
>>>>>> is the boolean parameter is for this purpose? Please guide me.
>>>>>>
>>>>>> --
>>>>>> Thanks
>>>>>> Vijay Veeraraghavan
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards
>>>>>> Vijay Veeraraghavan
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Vijay Veeraraghavan
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Vijay Veeraraghavan
>>>
>>
>>
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
>>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using IndexReader in the web environment

Posted by Vijay Veeraraghavan <vi...@gmail.com>.

Hi,
Thanks for the reply. So I will have a dedicated servlet to search the
index, but does it mean that the indexsearcher does not close the
index, keep it open? Is it not possible to keep it in the application
scope?

Vijay

On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
> Hi all,
>
> In a clustered environment I search the index from the web
> application. In the web application I am creating IndexReader on each
> request. is it expensive to do like this? I read somewhere in the web
> that try using the same reader as much as possible. Can i keep the
> initially created IndexReader in the session/application scopes and
> use the same for each request? Any other idea?
>
> Viay
>
> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
>> dear all,
>>
>> as replied below, does searching again for the document in the index
>> and if found skip the indexing else index it, is this not similar to
>> indexing all pdf documents once again, is not this overhead? As I am
>> not going to index the details of the pdf (so if an indexed pdf was
>> recreated i need not reindex it) but just the paths of the documents.
>>
>> Vijay
>>
>>>> Hey there,
>>>>
>>>> you might have to implement a some kind of unique identifier using an
>>>> indexed lucene field. When you are indexing you should fire a query
>>>> with
>>>> the
>>>> uuid of your document (maybe the path to you pdf document) and check if
>>>> the
>>>> document is in the index already. You could also do a boolean query
>>>> combining UUID, timestamp and / or a hash value to see if the document
>>>> has
>>>> been changed. if so you can simply update the document by its UUID
>>>> (something like indexwriter.updateDocument(new Term("uuid",
>>>> value),document);)
>>>>
>>>> Unfortunately you have to implement this yourself but it should not be
>>>> that
>>>> much of a deal.
>>>>
>>>> simon
>>>>
>>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>>> vijay.raghavan08@gmail.com> wrote:
>>>>
>>>>> Dear all,
>>>>> I am using lucene 3.0 to index the pdf reports that I generate
>>>>> dynamically. I index the pdf file name (without extension), file path
>>>>> and its absolute path as fields. I search with the file name without
>>>>> extension; it retrieves a list, as usually 2 or more files are present
>>>>> in the same name in different sub directories. As I create the index
>>>>> for the first time it updates, assuming 100 pdf files in different
>>>>> directories, the files meta info. If again I do indexing, while my
>>>>> report generator scheduler has the produced 500 more pdf files
>>>>> totaling to 600 files in different directories, I wish to index only
>>>>> the new files to the index. But presently it’s doing the whole thing
>>>>> again (600 files). How to implement this functionality? Think of the
>>>>> thousands of pdf files created on each run.
>>>>>
>>>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>>>> memory, as it exceeds thousands in a single run, and update the index
>>>>> looping this list.
>>>>>
>>>>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>>>>>                                        Version.LUCENE_CURRENT), true,
>>>>>
>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>
>>>>> is the boolean parameter is for this purpose? Please guide me.
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Vijay Veeraraghavan
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards
>>>>> Vijay Veeraraghavan
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Vijay Veeraraghavan
>>>
>>
>>
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
>>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>


-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using IndexReader in the web environment

Posted by Erick Erickson <er...@gmail.com>.

The quick answer is that the session is probably the wrong place to keep
an IndexReader, since that's per-user. I'd define  a new server/servlet that
did my searching and have my webapps use that. Makes it really simple
to re-use index readers.

And reopening the IndexReader for each request will probably cause you
to have very unsatisfactory search performance under load.

FWIW
Erick

Also:
>From Hossman's Apache page:

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



On Mon, May 3, 2010 at 6:21 AM, Vijay Veeraraghavan <
vijay.raghavan08@gmail.com> wrote:

> Hi all,
>
> In a clustered environment I search the index from the web
> application. In the web application I am creating IndexReader on each
> request. is it expensive to do like this? I read somewhere in the web
> that try using the same reader as much as possible. Can i keep the
> initially created IndexReader in the session/application scopes and
> use the same for each request? Any other idea?
>
> Viay
>
> On 5/3/10, Vijay Veeraraghavan <vi...@gmail.com> wrote:
> > dear all,
> >
> > as replied below, does searching again for the document in the index
> > and if found skip the indexing else index it, is this not similar to
> > indexing all pdf documents once again, is not this overhead? As I am
> > not going to index the details of the pdf (so if an indexed pdf was
> > recreated i need not reindex it) but just the paths of the documents.
> >
> > Vijay
> >
> >>> Hey there,
> >>>
> >>> you might have to implement a some kind of unique identifier using an
> >>> indexed lucene field. When you are indexing you should fire a query
> with
> >>> the
> >>> uuid of your document (maybe the path to you pdf document) and check if
> >>> the
> >>> document is in the index already. You could also do a boolean query
> >>> combining UUID, timestamp and / or a hash value to see if the document
> >>> has
> >>> been changed. if so you can simply update the document by its UUID
> >>> (something like indexwriter.updateDocument(new Term("uuid",
> >>> value),document);)
> >>>
> >>> Unfortunately you have to implement this yourself but it should not be
> >>> that
> >>> much of a deal.
> >>>
> >>> simon
> >>>
> >>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
> >>> vijay.raghavan08@gmail.com> wrote:
> >>>
> >>>> Dear all,
> >>>> I am using lucene 3.0 to index the pdf reports that I generate
> >>>> dynamically. I index the pdf file name (without extension), file path
> >>>> and its absolute path as fields. I search with the file name without
> >>>> extension; it retrieves a list, as usually 2 or more files are present
> >>>> in the same name in different sub directories. As I create the index
> >>>> for the first time it updates, assuming 100 pdf files in different
> >>>> directories, the files meta info. If again I do indexing, while my
> >>>> report generator scheduler has the produced 500 more pdf files
> >>>> totaling to 600 files in different directories, I wish to index only
> >>>> the new files to the index. But presently it’s doing the whole thing
> >>>> again (600 files). How to implement this functionality? Think of the
> >>>> thousands of pdf files created on each run.
> >>>>
> >>>> P.S: I cannot keep the meta-info of generated pdf files in the java
> >>>> memory, as it exceeds thousands in a single run, and update the index
> >>>> looping this list.
> >>>>
> >>>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
> >>>>                                        Version.LUCENE_CURRENT), true,
> >>>>
> >>>> IndexWriter.MaxFieldLength.LIMITED);
> >>>>
> >>>> is the boolean parameter is for this purpose? Please guide me.
> >>>>
> >>>> --
> >>>> Thanks
> >>>> Vijay Veeraraghavan
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks & Regards
> >>>> Vijay Veeraraghavan
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >> Thanks & Regards
> >> Vijay Veeraraghavan
> >>
> >
> >
> > --
> > Thanks & Regards
> > Vijay Veeraraghavan
> >
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>