You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sonfon <so...@gmail.com> on 2009/03/07 06:21:30 UTC

Search while indexing

Dear All,
    Now, I'm considering to build index for my application with lucene. However, as the document sources I'm going to index has many duplications, so before adding a document to an IndexWriter, I hope search in the index database first to see if a same document copy has already been added. I used IndexSearcher to search the same Dir while IndexWriter writing to it. However, it seem that IndexSearcher returned no result though I'm sure there are duplicate copies indexed already. And after the indexing procedure, I can get the search results, so I'm sure I didn't write the wrong code. Anyone could offer some help? Some example codes are appreciated.
    Best
Wishes.

Re: Search while indexing

Posted by Erick Erickson <er...@gmail.com>.
Lucene is an *indexing* engine, it isn't, and shouldn't be, in the
business of domain-specific knowledge, in this case web crawling.

You might look at Nutch, which is a web-crawling and indexing
application that uses Lucene as its underlying search engine, and
there are other web crawlers out there that take care of that part
of the process.

Best of luck!
Erick

On Sun, Mar 8, 2009 at 12:35 AM, sonfon <so...@gmail.com> wrote:

> Erick:
>    Thanks for your suggestion. I think another solution would be keeping an
> list of keywords that could uniquely identify a document in a database, and
> search for keywords before adding a new document. As querying database is
> fast, this probaly wouldn't cost much time. But this would request
> maintaining a database while indexing. I just wondered if lucene offers an
> interface identifying duplicates. I think identifying duplicate URLs when
> indexing web would be common.
>    Best
> Wishes.
> ----- Original Message -----
> From: "Erick Erickson" <er...@gmail.com>
> To: <ja...@lucene.apache.org>
> Sent: Saturday, March 07, 2009 10:58 PM
> Subject: Re: Search while indexing
>
>
> > First, you'll probably want to search the user list archive for this
> issue,
> > as
> > it's been discussed and you'll find more information than I can remember
> > off the top of my head. That said:
> >
> > 1> changes to an index are not visible until you reopen the reader. You
> >     probably have to flush the writer in the meantime. And this will
> >    be costly to do for every document.
> >
> > 2> How do you identify duplicates? If it's a short enough signature,
> >      you could consider keeping an in-memory list and check that
> >      while indexing. If you needed to update your index you could
> >      simply use TermEnum/TermDocs to read all the values into
> >      memory before adding to it.
> >
> > 3> You could consider using some kind of calculated signature of
> >     the whole file for your key, but that may not suit your app.
> >
> > Best
> > Erick
> >
> >
> >
> > On Sat, Mar 7, 2009 at 12:21 AM, sonfon <so...@gmail.com> wrote:
> >
> >> Dear All,
> >>    Now, I'm considering to build index for my application with lucene.
> >> However, as the document sources I'm going to index has many
> duplications,
> >> so before adding a document to an IndexWriter, I hope search in the
> index
> >> database first to see if a same document copy has already been added. I
> used
> >> IndexSearcher to search the same Dir while IndexWriter writing to it.
> >> However, it seem that IndexSearcher returned no result though I'm sure
> there
> >> are duplicate copies indexed already. And after the indexing procedure,
> I
> >> can get the search results, so I'm sure I didn't write the wrong code.
> >> Anyone could offer some help? Some example codes are appreciated.
> >>    Best
> >> Wishes.
> >
>

Re: Search while indexing

Posted by sonfon <so...@gmail.com>.
Erick:
    Thanks for your suggestion. I think another solution would be keeping an list of keywords that could uniquely identify a document in a database, and search for keywords before adding a new document. As querying database is fast, this probaly wouldn't cost much time. But this would request maintaining a database while indexing. I just wondered if lucene offers an interface identifying duplicates. I think identifying duplicate URLs when indexing web would be common. 
    Best
Wishes.
----- Original Message ----- 
From: "Erick Erickson" <er...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Saturday, March 07, 2009 10:58 PM
Subject: Re: Search while indexing


> First, you'll probably want to search the user list archive for this issue,
> as
> it's been discussed and you'll find more information than I can remember
> off the top of my head. That said:
> 
> 1> changes to an index are not visible until you reopen the reader. You
>     probably have to flush the writer in the meantime. And this will
>    be costly to do for every document.
> 
> 2> How do you identify duplicates? If it's a short enough signature,
>      you could consider keeping an in-memory list and check that
>      while indexing. If you needed to update your index you could
>      simply use TermEnum/TermDocs to read all the values into
>      memory before adding to it.
> 
> 3> You could consider using some kind of calculated signature of
>     the whole file for your key, but that may not suit your app.
> 
> Best
> Erick
> 
> 
> 
> On Sat, Mar 7, 2009 at 12:21 AM, sonfon <so...@gmail.com> wrote:
> 
>> Dear All,
>>    Now, I'm considering to build index for my application with lucene.
>> However, as the document sources I'm going to index has many duplications,
>> so before adding a document to an IndexWriter, I hope search in the index
>> database first to see if a same document copy has already been added. I used
>> IndexSearcher to search the same Dir while IndexWriter writing to it.
>> However, it seem that IndexSearcher returned no result though I'm sure there
>> are duplicate copies indexed already. And after the indexing procedure, I
>> can get the search results, so I'm sure I didn't write the wrong code.
>> Anyone could offer some help? Some example codes are appreciated.
>>    Best
>> Wishes.
>

Re: Search while indexing

Posted by Erick Erickson <er...@gmail.com>.
First, you'll probably want to search the user list archive for this issue,
as
it's been discussed and you'll find more information than I can remember
off the top of my head. That said:

1> changes to an index are not visible until you reopen the reader. You
     probably have to flush the writer in the meantime. And this will
    be costly to do for every document.

2> How do you identify duplicates? If it's a short enough signature,
      you could consider keeping an in-memory list and check that
      while indexing. If you needed to update your index you could
      simply use TermEnum/TermDocs to read all the values into
      memory before adding to it.

3> You could consider using some kind of calculated signature of
     the whole file for your key, but that may not suit your app.

Best
Erick



On Sat, Mar 7, 2009 at 12:21 AM, sonfon <so...@gmail.com> wrote:

> Dear All,
>    Now, I'm considering to build index for my application with lucene.
> However, as the document sources I'm going to index has many duplications,
> so before adding a document to an IndexWriter, I hope search in the index
> database first to see if a same document copy has already been added. I used
> IndexSearcher to search the same Dir while IndexWriter writing to it.
> However, it seem that IndexSearcher returned no result though I'm sure there
> are duplicate copies indexed already. And after the indexing procedure, I
> can get the search results, so I'm sure I didn't write the wrong code.
> Anyone could offer some help? Some example codes are appreciated.
>    Best
> Wishes.