You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Rob McCool <ro...@ksl.Stanford.EDU> on 2004/10/21 23:05:24 UTC

question about link text indexing

Hello, I'm trying to have Lucene index a cache of HTML pages. The URL of
each page is its primary identifier. We would also like to record link
text for each page, that is, for all other pages that link to a page, record
a Field called "linktext" which contains a stream of all the words that
are used to link to that page. We can then change scoring priorities based
on whether the page contains a word, and also on whether links to that page
contain a word.

I'm having trouble due to the model that the IndexWriter and Document 
interfaces use. My current algorithm is to create a new Document each time
we put a page into the cache, or each time we encounter link text for a page.
Any prior Documents found in the index corresponding to that URL have their
Fields copied to the new one. The old documents are then deleted, and the
new Document is added to the index.

The problem I have is that this is terribly slow. Because the delete()
method is on IndexReader, I have to continually close and re-open IndexWriters
and IndexReaders to avoid locking issues. This also results in a large number
of deleted Documents. I also have to mark "contents", the Field for the body
of each HTML document, as a storedTermVector Field. This results in larger
indices. The impact on indexing speed is noticable: I clocked it at about 200%
slower on a relatively small set of 5000 pages.

My original strategy was to perform a two-pass indexing, with the first pass
just recording the outgoing hyperlinks for each document, and the second pass
scanning all link text into memory and then copying each Document into a new 
index with all of the information together. But this means that if we do an
incremental crawl of a fast-turnover site, then the entire index needs to be
recomputed even if we only added a handful of pages.

What I'd like to do is to simply add a set of Terms to the postings for
an existing document every time I see new link text for it. I could then
use placeholders for documents I hadn't seen the contents of yet, and pull
in link text from the placeholders when writing the document's contents.
However, I can't figure out how to just add some entries to the posting table
for a particular Term with the current indexing system.

Can you give me some suggestions about how to accomplish this?

Thanks, Rob


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: question about link text indexing

Posted by Rob McCool <ro...@ksl.Stanford.EDU>.
> The problem you have is document updating. You can split updating
> into two steps:
> 
> 1) Open an IndexWriter and add new versions of the documents you want to
> update. Close the IndexWriter.
> 2) Open an IndexReader and delete old versions of these documents. Close
> the IndexReader.

Thanks, I'm going to try that. I was hoping that there were some private calls
or some small modification I could make to sneak in more postings for a 
document in the inverted index, but I can't find any easy way to do it. So
I'm going to try this.

Thanks to Christoph and Otis for responding.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: question about link text indexing

Posted by Christoph Goller <go...@detego-software.de>.
Rob McCool schrieb:
> Hello, I'm trying to have Lucene index a cache of HTML pages. The URL of
> each page is its primary identifier. We would also like to record link
> text for each page, that is, for all other pages that link to a page, record
> a Field called "linktext" which contains a stream of all the words that
> are used to link to that page. We can then change scoring priorities based
> on whether the page contains a word, and also on whether links to that page
> contain a word.
> 
> I'm having trouble due to the model that the IndexWriter and Document 
> interfaces use. My current algorithm is to create a new Document each time
> we put a page into the cache, or each time we encounter link text for a page.
> Any prior Documents found in the index corresponding to that URL have their
> Fields copied to the new one. The old documents are then deleted, and the
> new Document is added to the index.
> 
> The problem I have is that this is terribly slow. Because the delete()
> method is on IndexReader, I have to continually close and re-open IndexWriters
> and IndexReaders to avoid locking issues. This also results in a large number
> of deleted Documents. I also have to mark "contents", the Field for the body
> of each HTML document, as a storedTermVector Field. This results in larger
> indices. The impact on indexing speed is noticable: I clocked it at about 200%
> slower on a relatively small set of 5000 pages.
> 
> My original strategy was to perform a two-pass indexing, with the first pass
> just recording the outgoing hyperlinks for each document, and the second pass
> scanning all link text into memory and then copying each Document into a new 
> index with all of the information together. But this means that if we do an
> incremental crawl of a fast-turnover site, then the entire index needs to be
> recomputed even if we only added a handful of pages.
> 
> What I'd like to do is to simply add a set of Terms to the postings for
> an existing document every time I see new link text for it. I could then
> use placeholders for documents I hadn't seen the contents of yet, and pull
> in link text from the placeholders when writing the document's contents.
> However, I can't figure out how to just add some entries to the posting table
> for a particular Term with the current indexing system.
> 
> Can you give me some suggestions about how to accomplish this?
> 
> Thanks, Rob

The problem you have is document updating. You can split updating
into two steps:

1) Open an IndexWriter and add new versions of the documents you want to
update. Close the IndexWriter.
2) Open an IndexReader and delete old versions of these documents. Close
the IndexReader.

The advantage of this approach is, that you can do both steps for a whole list
of documents what makes it much more efficient. The disadvantage is that between
the two steps you have an index that contains several versions of some
documents. You might hide this from the user by not opening a new IndexReader
(Searcher) until step 2 finishes.

The question is how to delete "old" versions of a document. You can get a
TermDocs for the URL of each added document and thus identify all versions of
that document in the index. The latest version added is the one with the highest 
id. In step 2 you simply delete the documents with lower ids.

Christoph



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: question about link text indexing

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Redirecting to a more appropriate lucene-user list.  Some answers are
inlined.

--- Rob McCool <ro...@ksl.Stanford.EDU> wrote:

> Hello, I'm trying to have Lucene index a cache of HTML pages. The URL
> of
> each page is its primary identifier. We would also like to record
> link
> text for each page, that is, for all other pages that link to a page,
> record
> a Field called "linktext" which contains a stream of all the words
> that
> are used to link to that page. We can then change scoring priorities
> based
> on whether the page contains a word, and also on whether links to
> that page
> contain a word.

Nutch is probably not a solution for you, but it does what you
described above and much more.  See nutch.org.

> I'm having trouble due to the model that the IndexWriter and Document
> 
> interfaces use. My current algorithm is to create a new Document each
> time
> we put a page into the cache, or each time we encounter link text for
> a page.
> Any prior Documents found in the index corresponding to that URL have
> their
> Fields copied to the new one. The old documents are then deleted, and
> the
> new Document is added to the index.
> 
> The problem I have is that this is terribly slow. Because the
> delete()
> method is on IndexReader, I have to continually close and re-open
> IndexWriters
> and IndexReaders to avoid locking issues. This also results in a
> large number
> of deleted Documents. 

deleting and adding is best done in batches: first delete everything
you need to delete, then add everything you need to add.

> I also have to mark "contents", the Field for
> the body
> of each HTML document, as a storedTermVector Field. This results in
> larger
> indices. The impact on indexing speed is noticable: I clocked it at
> about 200%
> slower on a relatively small set of 5000 pages.

Larger indices make sense, slower indexing makes sense, too, but 200%
seems too high to me, unless your HTML pages are huge.

> My original strategy was to perform a two-pass indexing, with the
> first pass
> just recording the outgoing hyperlinks for each document, and the
> second pass
> scanning all link text into memory and then copying each Document
> into a new 
> index with all of the information together. But this means that if we
> do an
> incremental crawl of a fast-turnover site, then the entire index
> needs to be
> recomputed even if we only added a handful of pages.
> 
> What I'd like to do is to simply add a set of Terms to the postings
> for
> an existing document every time I see new link text for it. I could
> then
> use placeholders for documents I hadn't seen the contents of yet, and
> pull
> in link text from the placeholders when writing the document's
> contents.
> However, I can't figure out how to just add some entries to the
> posting table
> for a particular Term with the current indexing system.

To update a Field/Document you need to delete and re-add a Document,
unfortunately.

Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org