You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kevin Burton <bu...@newsmonster.org> on 2004/05/18 20:43:39 UTC

Internal full content store within Lucene

Per the discussion the other day about storing content external to 
Lucene I think we have an opportunity to improve the lucene core and 
bring a lot of functionality to future developers.

Right now Lucene allows you to have a 'stored' field which keeps the 
content with a segment along with your inverted index.

While this is flexible for small indexes in production environments it 
falls down because index merges take FOREVER.

A thread the other day opened up and suggesting storing just a pointer 
to a file on the filesystem.  This got me thinking about a long term 
mechanism I wanted for our cluster where we store content outside of the 
index in a high performance flat-file database.

The Lucene index would only maintain FILENO-:OFFSET:LENGTH info within 
the index and this would allow us to point to our flat file database. 

This would allow Lucene index merges to be FAST, support native field 
storage, and allow the filesystem optimize contiguous blocks for the 
flat content store.  Everyone wins.

This is what the Internet archive uses:

http://www.archive.org/web/researcher/ArcFileFormat.php

I propose that Lucene support a new form of stored field that allows 
external storage engine to keep the content in a flat text store.

How much interest is there for this?  I have to do this for work and 
will certainly take the extra effort into making this a standard Lucene 
feature. 

I can come up with a requirements doc and a more formal proposal in 
another email if I get enough +1s...

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Internal full content store within Lucene

Posted by Terry Steichen <te...@net-frame.com>.
+1

----- Original Message ----- 
From: "Kevin Burton" <bu...@newsmonster.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, May 18, 2004 2:43 PM
Subject: Internal full content store within Lucene


> Per the discussion the other day about storing content external to 
> Lucene I think we have an opportunity to improve the lucene core and 
> bring a lot of functionality to future developers.
> 
> Right now Lucene allows you to have a 'stored' field which keeps the 
> content with a segment along with your inverted index.
> 
> While this is flexible for small indexes in production environments it 
> falls down because index merges take FOREVER.
> 
> A thread the other day opened up and suggesting storing just a pointer 
> to a file on the filesystem.  This got me thinking about a long term 
> mechanism I wanted for our cluster where we store content outside of the 
> index in a high performance flat-file database.
> 
> The Lucene index would only maintain FILENO-:OFFSET:LENGTH info within 
> the index and this would allow us to point to our flat file database. 
> 
> This would allow Lucene index merges to be FAST, support native field 
> storage, and allow the filesystem optimize contiguous blocks for the 
> flat content store.  Everyone wins.
> 
> This is what the Internet archive uses:
> 
> http://www.archive.org/web/researcher/ArcFileFormat.php
> 
> I propose that Lucene support a new form of stored field that allows 
> external storage engine to keep the content in a flat text store.
> 
> How much interest is there for this?  I have to do this for work and 
> will certainly take the extra effort into making this a standard Lucene 
> feature. 
> 
> I can come up with a requirements doc and a more formal proposal in 
> another email if I get enough +1s...
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP.
> 
>     http://peerfear.org/pubkey.asc    
>     
>     NewsMonster - http://www.newsmonster.org/
>     
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>        AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Internal full content store within Lucene

Posted by Michael Giles <mg...@visionstudio.com>.
Certainly any advancement in this area seems like a good idea.

I'll throw a use case on the pile as well.  For my own interest, the 
biggest need is in highlighting (i.e. highlighting relevant segments within 
the full text of documents).  I need to provide highlighted abstracts in 
the search results, so the solution would need to be performant enough to 
provide that service.

-Mike

At 02:43 PM 5/18/2004, you wrote:
>Per the discussion the other day about storing content external to Lucene 
>I think we have an opportunity to improve the lucene core and bring a lot 
>of functionality to future developers.

________________________________________________________________________
Save and share anything you find online - Furl @ http://www.furl.net  



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Internal full content store within Lucene

Posted by Kevin Burton <bu...@newsmonster.org>.
Morus Walter wrote:

>Kevin Burton writes:
>  
>
>>How much interest is there for this?  I have to do this for work and 
>>will certainly take the extra effort into making this a standard Lucene 
>>feature. 
>>
>>    
>>
>Sounds interesting.
>How would you handle deletions?
>  
>
They aren't a requirement in our scenario... It would probably be more 
efficient to just leave the content on disk.

If you want to GC over time the arc files can be grouped together by 
time so you can just eventually delete a whole arc file...

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: Internal full content store within Lucene

Posted by Morus Walter <mo...@tanto.de>.
Kevin Burton writes:
> 
> How much interest is there for this?  I have to do this for work and 
> will certainly take the extra effort into making this a standard Lucene 
> feature. 
> 
Sounds interesting.
How would you handle deletions?

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org