You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jake dsouza <ja...@gmail.com> on 2012/11/09 09:41:00 UTC
Indexing and searching across versioned document collections
Hello,
Has any one worked on making Lucene index and search versioned document
collections i.e any corpus with multiple versions of documents similar to
wikipedia or source code.
I am working on a project to index and search versioned collections while
keeping the index size minimum by taking into consideration differences in
the versions to minimize the size of the index .
Could some one direct me to any existing efforts to make Lucene work with
versions .
Thanks
Jake
Re: Indexing and searching across versioned document collections
Posted by Paul Jungwirth <pj...@illuminatedcomputing.com>.
Hi Jake,
The Lucene in Action book has a case study about Krugle, a source code
searching application. I don't recall that chapter mentioning tracking
multiple versions, but I'd be surprised if they didn't have to deal with it
somehow. That might be one place to start.
Good luck!
Paul
On Fri, Nov 9, 2012 at 12:41 AM, jake dsouza <ja...@gmail.com> wrote:
> Hello,
>
> Has any one worked on making Lucene index and search versioned document
> collections i.e any corpus with multiple versions of documents similar to
> wikipedia or source code.
> I am working on a project to index and search versioned collections while
> keeping the index size minimum by taking into consideration differences in
> the versions to minimize the size of the index .
>
> Could some one direct me to any existing efforts to make Lucene work with
> versions .
>
> Thanks
> Jake
>
--
_________________________________
Pulchritudo splendor veritatis.
Re: Indexing and searching across versioned document collections
Posted by "Johannes.Lichtenberger" <Jo...@uni-konstanz.de>.
On 11/09/2012 09:41 AM, jake dsouza wrote:
> Hello,
>
> Has any one worked on making Lucene index and search versioned document
> collections i.e any corpus with multiple versions of documents similar to
> wikipedia or source code.
> I am working on a project to index and search versioned collections while
> keeping the index size minimum by taking into consideration differences in
> the versions to minimize the size of the index .
>
> Could some one direct me to any existing efforts to make Lucene work with
> versions .
Hello Jake,
I never found the time, but it's still on my todo list, for a versioned
XML DBS[1]. But that is also my issue, I somehow would need the internal
buckets or nodes or whatever index structure it uses. For instance with
a PATRICIA trie it's very simple with my system, as I can just store the
nodes, which are then versioned (CoW-principle such that only changed
nodes are written, depending on the versioning strategy used (maybe also
a bunch of nodes in a "page" which holds a set of nodes). I never
figured out how todo this with Lucene, that's why I'm thinking about
implementing or simply integrating a PATRICIA-trie and enhance an XQuery
parser with fulltext capabilities.
However, _if_ it's possible with Lucene it would be great :-) That said
it's open source and maybe anyone would have some value and is motivated
to contribute, but that's just a wish ;-)
kind regards,
Johannes
[1] https://github.com/JohannesLichtenberger/sirix
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org