You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jake dsouza <ja...@gmail.com> on 2012/11/09 09:41:00 UTC

Indexing and searching across versioned document collections

Hello,

Has any one worked on making Lucene index and search versioned document
collections i.e any corpus with multiple versions of documents similar to
wikipedia or source code.
I am working on a project to index and search versioned collections while
keeping the index size minimum by taking into consideration differences in
the versions to minimize the size of the index .

Could some one direct me to any existing efforts to make Lucene work with
versions .

Thanks
Jake

Re: Indexing and searching across versioned document collections

Posted by Paul Jungwirth <pj...@illuminatedcomputing.com>.
Hi Jake,

The Lucene in Action book has a case study about Krugle, a source code
searching application. I don't recall that chapter mentioning tracking
multiple versions, but I'd be surprised if they didn't have to deal with it
somehow. That might be one place to start.

Good luck!
Paul



On Fri, Nov 9, 2012 at 12:41 AM, jake dsouza <ja...@gmail.com> wrote:

> Hello,
>
> Has any one worked on making Lucene index and search versioned document
> collections i.e any corpus with multiple versions of documents similar to
> wikipedia or source code.
> I am working on a project to index and search versioned collections while
> keeping the index size minimum by taking into consideration differences in
> the versions to minimize the size of the index .
>
> Could some one direct me to any existing efforts to make Lucene work with
> versions .
>
> Thanks
> Jake
>



-- 
_________________________________
Pulchritudo splendor veritatis.

Re: Indexing and searching across versioned document collections

Posted by "Johannes.Lichtenberger" <Jo...@uni-konstanz.de>.
On 11/09/2012 09:41 AM, jake dsouza wrote:
> Hello,
>
> Has any one worked on making Lucene index and search versioned document
> collections i.e any corpus with multiple versions of documents similar to
> wikipedia or source code.
> I am working on a project to index and search versioned collections while
> keeping the index size minimum by taking into consideration differences in
> the versions to minimize the size of the index .
>
> Could some one direct me to any existing efforts to make Lucene work with
> versions .

Hello Jake,

I never found the time, but it's still on my todo list, for a versioned 
XML DBS[1]. But that is also my issue, I somehow would need the internal 
buckets or nodes or whatever index structure it uses. For instance with 
a PATRICIA trie it's very simple with my system, as I can just store the 
nodes, which are then versioned (CoW-principle such that only changed 
nodes are written, depending on the versioning strategy used (maybe also 
a bunch of nodes in a "page" which holds a set of nodes). I never 
figured out how todo this with Lucene, that's why I'm thinking about 
implementing or simply integrating a PATRICIA-trie and enhance an XQuery 
parser with fulltext capabilities.

However, _if_ it's possible with Lucene it would be great :-) That said 
it's open source and maybe anyone would have some value and is motivated 
to contribute, but that's just a wish ;-)

kind regards,
Johannes

[1] https://github.com/JohannesLichtenberger/sirix


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org