You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2010/11/08 20:03:59 UTC

Re: Document links

Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <ma...@yahoo.co.uk> wrote:
>>>Both these on disk data structures and the ones in a B+ tree have seek offsets
>>>into files
>>>that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>>But do these disk data structures support dynamic addition and deletion of
>>>(larger
>>>numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> ----- Original Message ----
> From: Paul Elschot <pa...@xs4all.nl>
> To: dev@lucene.apache.org
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>>traversals have exponential numbers of links and so all these index disk seeks
>>start to stack up. The solution I propose uses doc ids as more-or-less direct
>>pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and will
>>compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, but
> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index of
>>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>>milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m
>>edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and the
>>difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>>other forms of link analysis e.g. recommendation engines require similar
>>performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>>around
>>
>> >>>> "layered segments" for dealing with field updates
>> >>
>> >> Right. Fast discovery of document relations is a foundation on which lots of
>>
>> >> things like this can build. Relations can be given types to support a number
>>of
>>
>> >> different use cases.
>> >
>> > How about using this (bsd licenced) tree as a starting point:
>> > http://bplusdotnet.sourceforge.net/
>> > It has various keys: ao. byte array, String and long.
>> >
>> > A fixed size byte array as key seems to be just fine: two bytes for a field
>>number,
>> > four for the segment number and four for the in-segment document id.
>> > The separate segment number would allow to minimize the updates
>> > in the tree during merges. One could also use the normal doc id directly.
>> >
>> > The value could then be a similar to the key, but without
>> > the field number, and with an indication of the direction of the link.
>> > Or perhaps the direction of the link should be added to the key.
>> > A link would be present twice, once for each direction.
>> > Also both directions could have their own payloads.
>> >
>> > It could be put in its own file as a separate 'segment', or maybe
>> > each segment could allow for allocation of a part of this tree.
>> >
>> > I like this somehow, in case it is done right one might never
>> > need a relational database again. Well, almost...
>> >
>> > Regards,
>> > Paul Elschot
>> >
>> >
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> From: Grant Ingersoll <gs...@apache.org>
>> >> To: dev@lucene.apache.org
>> >> Sent: Fri, 24 September, 2010 16:26:27
>> >> Subject: Re: Document links
>> >>
>> >> While not exactly equivalent, it reminds me of our earlier discussion around
>>
>> >> "layered segments" for dealing with field updates [1], [2], albeit this is a
>>bit
>>
>> >> more generic since one could not only use the links for relating documents,
>>but
>>
>> >> one could use "special" links underneath the covers in Lucene to
>>maintain/mark
>>
>> >> which fields have been updated and then traverse to them.
>> >>
>> >> [1]
>> >>
>>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>> >>
>> >> [2]
>> >>
>>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>> >>
>> >>
>> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> >>
>> >>> This slideshow has a first-cut on the Lucene file format extensions
>>required to
>>
>> >>>
>> >>> support fast linking between documents:
>> >>>
>> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>> >>>
>> >>>
>> >>> Interested in any of your thoughts.
>> >>>
>> >>> Cheers,
>> >>> Mark
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>
>> >> --------------------------
>> >> Grant Ingersoll
>> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by Mark Harwood <ma...@yahoo.co.uk>.
I was using within-segment doc ids stored in link files named after both the source and target segments (a link after all is 2 endpoints).
For a complete solution you ultimately have to deal with the fact that doc ids could be references to:
* Stable, committed docs (the easy case)
* Flushed but not yet committed docs
* Buffered but not yet flushed docs
* Flushed/committed but currently merging docs

...all of which are happening in different threads e.g reader has one view of the world, a background thread is busy merging segments to create a new view of the world even after commits have completed.

All very messy.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 08 November 2010 20:52:53 mark harwood wrote:
> I came to the conclusion that the transient meaning of document ids is too 
> deeply ingrained in Lucene's design to use them to underpin any reliable 
> linking.
> While it might work for relatively static indexes, any index with a reasonable 
> number of updates or deletes will invalidate any stored document references in 
> ways which are very hard to track. Lucene's compaction shuffles IDs without 
> taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling 
> IDs" here: http://goo.gl/5UbJi )

Did you try to keep the docId's as segmentId-inSegmentDocId in a tree?

Somehow I think this could work, because this would not really complicate
adding/deleting relations between docId's and the segment merges would
become massive but straightforward deletes and inserts, and with some
luck the amount of work for that would be a small proportion of the work
for a normal segment merge.

Regards,
Paul Elschot

> 
> 
> Cheers,
> Mark
> 
> 
> ----- Original Message ----
> From: Ryan McKinley <ry...@gmail.com>
> To: dev@lucene.apache.org
> Sent: Mon, 8 November, 2010 19:03:59
> Subject: Re: Document links
> 
> Any updates/progress with this?
> 
> I'm looking at ways to implement an RTree with lucene -- and this
> discussion seems relevant
> 
> thanks
> ryan
> 
> 
> On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <ma...@yahoo.co.uk> wrote:
> >>>Both these on disk data structures and the ones in a B+ tree have seek 
> offsets
> >>>into files
> >>>that require disk seeks. And both could use document ids as key values.
> >
> > Yep. However my approach doesn't use a doc id as a key that is searched in any
> > B+ tree index (which involves disk seeks) - it is used as direct offset into a
> > file to get the pointer into a "links" data structure.
> >
> >
> >
> >>>But do these disk data structures support dynamic addition and deletion of
> >>>(larger
> >>>numbers of) document links?
> >
> > Yes, the slide deck I linked to shows how links (like documents) spend the 
> >early
> > stages of life being merged frequently in the smaller, newer segments and over
> > time migrate into larger, more stable segments as part of Lucene transactions.
> >
> > That's the theory - I'm currently benchmarking an early prototype.
> >
> >
> >
> > ----- Original Message ----
> > From: Paul Elschot <pa...@xs4all.nl>
> > To: dev@lucene.apache.org
> > Sent: Sat, 25 September, 2010 22:03:28
> > Subject: Re: Document links
> >
> > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
> >> My starting point in the solution I propose was to eliminate linking via any
> >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
> >>traversals have exponential numbers of links and so all these index disk seeks
> >>start to stack up. The solution I propose uses doc ids as more-or-less direct
> >>pointers into file structures avoiding any index lookup.
> >> I've started coding up some tests using the file structures I outlined and 
> >will
> >>compare that with a traditional key-based approach.
> >
> > Both these on disk data structures and the ones in a B+ tree have seek offsets
> > into files
> > that require disk seeks. And both could use document ids as key values.
> >
> > But do these disk data structures support dynamic addition and deletion of
> > (larger
> > numbers of) document links?
> >
> > B+ trees are a standard solution for problems like this one, and it would
> > probably
> > not be easy to outperform them.
> > It may be possible to improve performance of B+ trees somewhat by specializing
> > for the fairly simple keys that would be needed, and by encoding very short
> > lists of links
> > for a single document directly into a seek offset to avoid the actual seek, 
> but
> > that's
> > about it.
> >
> > Regards,
> > Paul Elschot
> >
> >>
> >> For reference - playing the "Kevin Bacon game" on a traditional Lucene index 
> >of
> >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
> >>milliseconds on the same data (and this was a disk based graph of 3m nodes, 
> 10m
> >>edges).
> >> Going from actor->movies->actors->movies produces a lot of key lookups and 
> the
> >>difference between key indexes and direct node pointers becomes clear.
> >> I know path finding analysis is perhaps not a typical Lucene application but
> >>other forms of link analysis e.g. recommendation engines require similar
> >>performance.
> >>
> >> Cheers
> >> Mark
> >>
> >>
> >>
> >> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
> >>
> >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >> >>>> While not exactly equivalent, it reminds me of our earlier discussion
> >>around
> >>
> >> >>>> "layered segments" for dealing with field updates
> >> >>
> >> >> Right. Fast discovery of document relations is a foundation on which lots 
> >of
> >>
> >> >> things like this can build. Relations can be given types to support a 
> >number
> >>of
> >>
> >> >> different use cases.
> >> >
> >> > How about using this (bsd licenced) tree as a starting point:
> >> > http://bplusdotnet.sourceforge.net/
> >> > It has various keys: ao. byte array, String and long.
> >> >
> >> > A fixed size byte array as key seems to be just fine: two bytes for a field
> >>number,
> >> > four for the segment number and four for the in-segment document id.
> >> > The separate segment number would allow to minimize the updates
> >> > in the tree during merges. One could also use the normal doc id directly.
> >> >
> >> > The value could then be a similar to the key, but without
> >> > the field number, and with an indication of the direction of the link.
> >> > Or perhaps the direction of the link should be added to the key.
> >> > A link would be present twice, once for each direction.
> >> > Also both directions could have their own payloads.
> >> >
> >> > It could be put in its own file as a separate 'segment', or maybe
> >> > each segment could allow for allocation of a part of this tree.
> >> >
> >> > I like this somehow, in case it is done right one might never
> >> > need a relational database again. Well, almost...
> >> >
> >> > Regards,
> >> > Paul Elschot
> >> >
> >> >
> >> >>
> >> >>
> >> >>
> >> >> ----- Original Message ----
> >> >> From: Grant Ingersoll <gs...@apache.org>
> >> >> To: dev@lucene.apache.org
> >> >> Sent: Fri, 24 September, 2010 16:26:27
> >> >> Subject: Re: Document links
> >> >>
> >> >> While not exactly equivalent, it reminds me of our earlier discussion 
> >around
> >>
> >> >> "layered segments" for dealing with field updates [1], [2], albeit this is 
> >a
> >>bit
> >>
> >> >> more generic since one could not only use the links for relating 
> documents,
> >>but
> >>
> >> >> one could use "special" links underneath the covers in Lucene to
> >>maintain/mark
> >>
> >> >> which fields have been updated and then traverse to them.
> >> >>
> >> >> [1]
> >> >>
> >>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
> >>
> >>
> >> >>
> >> >> [2]
> >> >>
> >>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
> >>
> >>
> >> >>
> >> >>
> >> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
> >> >>
> >> >>> This slideshow has a first-cut on the Lucene file format extensions
> >>required to
> >>
> >> >>>
> >> >>> support fast linking between documents:
> >> >>>
> >> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
> >> >>>
> >> >>>
> >> >>> Interested in any of your thoughts.
> >> >>>
> >> >>> Cheers,
> >> >>> Mark
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>>
> >> >>
> >> >> --------------------------
> >> >> Grant Ingersoll
> >> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 08 November 2010 21:34:18 Ryan McKinley wrote:
> On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot <pa...@xs4all.nl> wrote:
> > On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
> >> Any updates/progress with this?
> >>
> >> I'm looking at ways to implement an RTree with lucene -- and this
> >> discussion seems relevant
> >
> > Did you consider merging the bits of each dimension into a NumericField?
> >
> > For example: one dimension a0 a1 .. an
> > and a second dimension b0 b1 .. bn
> > into a0 b0 a1 b1 .. an bn
> > and then index this number as a NumericField.
> >
> 
> Something like the geohash algorithm but with n dimensions?

Yes. It is also a simple bounded volume hierarchy.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by Ryan McKinley <ry...@gmail.com>.
On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot <pa...@xs4all.nl> wrote:
> On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
>> Any updates/progress with this?
>>
>> I'm looking at ways to implement an RTree with lucene -- and this
>> discussion seems relevant
>
> Did you consider merging the bits of each dimension into a NumericField?
>
> For example: one dimension a0 a1 .. an
> and a second dimension b0 b1 .. bn
> into a0 b0 a1 b1 .. an bn
> and then index this number as a NumericField.
>

Something like the geohash algorithm but with n dimensions?

The linking work that Mark discussed seems nice since it would give
faster access to navigating the tree -- finding N nearest neigbhors
etc...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
> Any updates/progress with this?
> 
> I'm looking at ways to implement an RTree with lucene -- and this
> discussion seems relevant

Did you consider merging the bits of each dimension into a NumericField?

For example: one dimension a0 a1 .. an
and a second dimension b0 b1 .. bn
into a0 b0 a1 b1 .. an bn
and then index this number as a NumericField.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by Ryan McKinley <ry...@gmail.com>.
On Mon, Nov 8, 2010 at 2:52 PM, mark harwood <ma...@yahoo.co.uk> wrote:
> I came to the conclusion that the transient meaning of document ids is too
> deeply ingrained in Lucene's design to use them to underpin any reliable
> linking.

What about if we define an id field (like in solr)?

Whatever does the traversal would need to make a Map<id,docID>, but
that is still better then then needing to do a query for each link.


> While it might work for relatively static indexes, any index with a reasonable
> number of updates or deletes will invalidate any stored document references in
> ways which are very hard to track. Lucene's compaction shuffles IDs without
> taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling
> IDs" here: http://goo.gl/5UbJi )
>

oh ya -- and it is even more akward since each subreader often reuses
the same docId

ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document links

Posted by mark harwood <ma...@yahoo.co.uk>.
I came to the conclusion that the transient meaning of document ids is too 
deeply ingrained in Lucene's design to use them to underpin any reliable 
linking.
While it might work for relatively static indexes, any index with a reasonable 
number of updates or deletes will invalidate any stored document references in 
ways which are very hard to track. Lucene's compaction shuffles IDs without 
taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling 
IDs" here: http://goo.gl/5UbJi )


Cheers,
Mark


----- Original Message ----
From: Ryan McKinley <ry...@gmail.com>
To: dev@lucene.apache.org
Sent: Mon, 8 November, 2010 19:03:59
Subject: Re: Document links

Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <ma...@yahoo.co.uk> wrote:
>>>Both these on disk data structures and the ones in a B+ tree have seek 
offsets
>>>into files
>>>that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>>But do these disk data structures support dynamic addition and deletion of
>>>(larger
>>>numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the 
>early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> ----- Original Message ----
> From: Paul Elschot <pa...@xs4all.nl>
> To: dev@lucene.apache.org
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>>traversals have exponential numbers of links and so all these index disk seeks
>>start to stack up. The solution I propose uses doc ids as more-or-less direct
>>pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and 
>will
>>compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, 
but
> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index 
>of
>>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>>milliseconds on the same data (and this was a disk based graph of 3m nodes, 
10m
>>edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and 
the
>>difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>>other forms of link analysis e.g. recommendation engines require similar
>>performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>>around
>>
>> >>>> "layered segments" for dealing with field updates
>> >>
>> >> Right. Fast discovery of document relations is a foundation on which lots 
>of
>>
>> >> things like this can build. Relations can be given types to support a 
>number
>>of
>>
>> >> different use cases.
>> >
>> > How about using this (bsd licenced) tree as a starting point:
>> > http://bplusdotnet.sourceforge.net/
>> > It has various keys: ao. byte array, String and long.
>> >
>> > A fixed size byte array as key seems to be just fine: two bytes for a field
>>number,
>> > four for the segment number and four for the in-segment document id.
>> > The separate segment number would allow to minimize the updates
>> > in the tree during merges. One could also use the normal doc id directly.
>> >
>> > The value could then be a similar to the key, but without
>> > the field number, and with an indication of the direction of the link.
>> > Or perhaps the direction of the link should be added to the key.
>> > A link would be present twice, once for each direction.
>> > Also both directions could have their own payloads.
>> >
>> > It could be put in its own file as a separate 'segment', or maybe
>> > each segment could allow for allocation of a part of this tree.
>> >
>> > I like this somehow, in case it is done right one might never
>> > need a relational database again. Well, almost...
>> >
>> > Regards,
>> > Paul Elschot
>> >
>> >
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> From: Grant Ingersoll <gs...@apache.org>
>> >> To: dev@lucene.apache.org
>> >> Sent: Fri, 24 September, 2010 16:26:27
>> >> Subject: Re: Document links
>> >>
>> >> While not exactly equivalent, it reminds me of our earlier discussion 
>around
>>
>> >> "layered segments" for dealing with field updates [1], [2], albeit this is 
>a
>>bit
>>
>> >> more generic since one could not only use the links for relating 
documents,
>>but
>>
>> >> one could use "special" links underneath the covers in Lucene to
>>maintain/mark
>>
>> >> which fields have been updated and then traverse to them.
>> >>
>> >> [1]
>> >>
>>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>>
>> >>
>> >> [2]
>> >>
>>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>>
>> >>
>> >>
>> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> >>
>> >>> This slideshow has a first-cut on the Lucene file format extensions
>>required to
>>
>> >>>
>> >>> support fast linking between documents:
>> >>>
>> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>> >>>
>> >>>
>> >>> Interested in any of your thoughts.
>> >>>
>> >>> Cheers,
>> >>> Mark
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>
>> >> --------------------------
>> >> Grant Ingersoll
>> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org