You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kevin Burton <bu...@newsmonster.org> on 2004/05/18 20:43:39 UTC
Internal full content store within Lucene
Per the discussion the other day about storing content external to
Lucene I think we have an opportunity to improve the lucene core and
bring a lot of functionality to future developers.
Right now Lucene allows you to have a 'stored' field which keeps the
content with a segment along with your inverted index.
While this is flexible for small indexes in production environments it
falls down because index merges take FOREVER.
A thread the other day opened up and suggesting storing just a pointer
to a file on the filesystem. This got me thinking about a long term
mechanism I wanted for our cluster where we store content outside of the
index in a high performance flat-file database.
The Lucene index would only maintain FILENO-:OFFSET:LENGTH info within
the index and this would allow us to point to our flat file database.
This would allow Lucene index merges to be FAST, support native field
storage, and allow the filesystem optimize contiguous blocks for the
flat content store. Everyone wins.
This is what the Internet archive uses:
http://www.archive.org/web/researcher/ArcFileFormat.php
I propose that Lucene support a new form of stored field that allows
external storage engine to keep the content in a flat text store.
How much interest is there for this? I have to do this for work and
will certainly take the extra effort into making this a standard Lucene
feature.
I can come up with a requirements doc and a more formal proposal in
another email if I get enough +1s...
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Internal full content store within Lucene
Posted by Terry Steichen <te...@net-frame.com>.
+1
----- Original Message -----
From: "Kevin Burton" <bu...@newsmonster.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, May 18, 2004 2:43 PM
Subject: Internal full content store within Lucene
> Per the discussion the other day about storing content external to
> Lucene I think we have an opportunity to improve the lucene core and
> bring a lot of functionality to future developers.
>
> Right now Lucene allows you to have a 'stored' field which keeps the
> content with a segment along with your inverted index.
>
> While this is flexible for small indexes in production environments it
> falls down because index merges take FOREVER.
>
> A thread the other day opened up and suggesting storing just a pointer
> to a file on the filesystem. This got me thinking about a long term
> mechanism I wanted for our cluster where we store content outside of the
> index in a high performance flat-file database.
>
> The Lucene index would only maintain FILENO-:OFFSET:LENGTH info within
> the index and this would allow us to point to our flat file database.
>
> This would allow Lucene index merges to be FAST, support native field
> storage, and allow the filesystem optimize contiguous blocks for the
> flat content store. Everyone wins.
>
> This is what the Internet archive uses:
>
> http://www.archive.org/web/researcher/ArcFileFormat.php
>
> I propose that Lucene support a new form of stored field that allows
> external storage engine to keep the content in a flat text store.
>
> How much interest is there for this? I have to do this for work and
> will certainly take the extra effort into making this a standard Lucene
> feature.
>
> I can come up with a requirements doc and a more formal proposal in
> another email if I get enough +1s...
>
> Kevin
>
> --
>
> Please reply using PGP.
>
> http://peerfear.org/pubkey.asc
>
> NewsMonster - http://www.newsmonster.org/
>
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
> AIM/YIM - sfburtonator, Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
> IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Internal full content store within Lucene
Posted by Michael Giles <mg...@visionstudio.com>.
Certainly any advancement in this area seems like a good idea.
I'll throw a use case on the pile as well. For my own interest, the
biggest need is in highlighting (i.e. highlighting relevant segments within
the full text of documents). I need to provide highlighted abstracts in
the search results, so the solution would need to be performant enough to
provide that service.
-Mike
At 02:43 PM 5/18/2004, you wrote:
>Per the discussion the other day about storing content external to Lucene
>I think we have an opportunity to improve the lucene core and bring a lot
>of functionality to future developers.
________________________________________________________________________
Save and share anything you find online - Furl @ http://www.furl.net
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Internal full content store within Lucene
Posted by Kevin Burton <bu...@newsmonster.org>.
Morus Walter wrote:
>Kevin Burton writes:
>
>
>>How much interest is there for this? I have to do this for work and
>>will certainly take the extra effort into making this a standard Lucene
>>feature.
>>
>>
>>
>Sounds interesting.
>How would you handle deletions?
>
>
They aren't a requirement in our scenario... It would probably be more
efficient to just leave the content on disk.
If you want to GC over time the arc files can be grouped together by
time so you can just eventually delete a whole arc file...
Kevin
--
Please reply using PGP.
http://peerfear.org/pubkey.asc
NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Re: Internal full content store within Lucene
Posted by Morus Walter <mo...@tanto.de>.
Kevin Burton writes:
>
> How much interest is there for this? I have to do this for work and
> will certainly take the extra effort into making this a standard Lucene
> feature.
>
Sounds interesting.
How would you handle deletions?
Morus
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org