You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2010/03/16 04:57:28 UTC

ProximityQuery

I'd like to offer a proximity query type in my app, so that I can search like:

 foo NEAR10 bar

to find all instances of 'foo' within 10 token positions of 'bar'.[0]

It seems like the place to start, if I were to take the route of
subclassing/extending an existing class, is the PhraseQuery feature,
specifically the PhraseScorer and the internal winnow_anchors() function. Am I
on the right track here?

[0] I believe Lucene syntax for that query is "foo bar"~10

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/20/10 2:09 PM:

> Calc_Phrase_Freq() is documented in PhraseScorer.bp.  I thought about adding
> additional explanatory comments to PhraseScorer.c, but I think that might be
> overkill.  If you had seen this, would it have been enough?
> 
>     /** Calculate how often the phrase occurs in the current document.
>      */
>     float
>     Calc_Phrase_Freq(PhraseScorer *self);
> 

yes, it would have been enough. There is a convention in C of putting the
comments/documentation for functions in the .h file, which has an analog in the
.bp files in Lucy/KS. Mostly this just means needing to look in 2 places
more-or-less simultaneously, since the .bp files (like .h) mostly just hold the
signatures. It's fine; it's just something I have to remember since my
day-to-day is with languages that don't have separate .h/.bp and .c files.


>> All those GOTO calls are indeed "non-standard form" (wink wink, nudge nudge)
>> and were what sparked my initial question to the list.
> 
> I thought about commenting that section, but I couldn't really improve on
> Nate's code.  It's self-documenting and impressively compact.  The larger
> algorithm is described in the multi-line comment above.  So maybe leave it as
> is for now?

it's fine to leave it as it is. The terse-ness of the code just meant that the
late hour at which I first started looking at this was not a good time for
full-brain functioning. ;)

> parcel Lucy;
> 
> /** Quick-start guide to hacking on Lucy.

very clear and helpful. Wish I had it months ago. :) Now I understand the
function-vs-method capitalization.

thanks, Marvin.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 19, 2010 at 10:08:21PM -0500, Peter Karman wrote:
> Aesthetic comment: I like C++ style comments for legibility and
> speed-of-writing.

Agreed, they are definitely superior for line comments.

> The longer block of comments around PhraseScorer_calc_phrase_freq is
> helpful.  The concept of "phrase frequency" had not scored well on my
> grok-o-meter.

Calc_Phrase_Freq() is documented in PhraseScorer.bp.  I thought about adding
additional explanatory comments to PhraseScorer.c, but I think that might be
overkill.  If you had seen this, would it have been enough?

    /** Calculate how often the phrase occurs in the current document.
     */
    float
    Calc_Phrase_Freq(PhraseScorer *self);

> The capitalization of function names confuses me (not specific to this
> revision). I see PhraseScorer_Calc_Phrase_Freq and
> PhraseScorer_calc_phrase_freq. I know intuitively that somehow that
> convention must be internally consistent with the magic of Clownfish, etc.,
> so I'm guessing I just haven't yet come across where the difference in case
> is documented.

OK, this was good to know.  The capitalization distinction is actually very
important, and the fact that you're this deeply involved and weren't 100%
clear means that we definitely need to take action to improve the situation. 

Rather than explain things in an email reply here (which fresh Lucy hackers
wouldn't see), I've written up a proposed Lucy::Docs::DevGuide starter.  It's
below my sig.

> All those GOTO calls are indeed "non-standard form" (wink wink, nudge nudge)
> and were what sparked my initial question to the list.

I thought about commenting that section, but I couldn't really improve on
Nate's code.  It's self-documenting and impressively compact.  The larger
algorithm is described in the multi-line comment above.  So maybe leave it as
is for now?

Marvin Humphrey


parcel Lucy;

/** Quick-start guide to hacking on Lucy.
 *
 * The Lucy code base is organized into roughly four layers:
 *  
 *    * Charmonizer - compiler and OS configuration probing.
 *    * Clownfish - header files.
 *    * C - implementation files.
 *    * Host - binding language.
 * 
 * Charmonizer is a configuration prober which writes a single header file,
 * "charmony.h", describing the build environment and facilitating
 * cross-platform development.  It's similar to Autoconf or Metaconfig, but
 * written in pure C.
 *
 * The ".cfh" (or historically, ".bp") files within the Lucy core are
 * Clownfish header files.  Clownfish is a purpose-built, declaration-only
 * language which superimposes a single-inheritance object model on top of C
 * which is specifically designed to co-exist happily with variety of "host"
 * languages and to allow limited run-time dynamic subclassing.  For more
 * information see the Clownfish docs, but if there's one thing you should
 * know about Clownfish OO before you start hacking, it's that method calls
 * are differentiated from functions by capitalization:
 *
 *     Indexer_Add_Doc   <-- Method, typically uses dynamic dispatch.
 *     Indexer_add_doc   <-- Function, always a direct invocation.
 * 
 * The C files within the Lucy core are where most of Lucy's low-level
 * functionality lies.  They implement the interface defined by the Clownfish
 * header files.
 *
 * The C core is intentionally left incomplete, however; to be usable, it must
 * be bound to a "host" language.  (In this context, even C is considered a
 * "host" which must implement the missing pieces and be "bound" to the core.)
 * Some of the binding code is autogenerated by Clownfish on a spec customized
 * for each language.  Other pieces are hand-coded in either C (using the
 * host's C API) or the host language itself.
 */

inert class Lucy::Docs::DevGuide { }


Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/22/10 1:10 AM:

> 
> We should zap my TODO test.
> 

ok.


>> I think simpler is better here: if you want order to not matter, then OR
>> together the various orders you might be interested in. In fact, I may offer
>> that as an option in the Search::Query::Parser, which could then do the ORing
>> programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS
>> QueryParser, could do something similar.
> 
> I'd rather shunt people who need more than the basic syntax of the core
> QueryParser towards yours than try to imitate it.  :)

heh. fair enough. :)

I've added prelim support to Search::Query::Dialect::KSx and
Search::Query::Parser in svn. I'm not sure yet how to offer the optional "ignore
order" feature. Maybe a 'ignore_order_in_proximity' flag. I'll have to think
about if that should also affect serialization. It would need to be in
Dialect::KSx, since the other dialects I'm currently supporting do not have that
kind of feature.

> 
>>> Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
>>> but "proximity" is better) instead of "near" for the name of that parameter.
>>> Or alternately, "slop", but I understand why you went with nearness instead.
>>
>> I like 'proximity' for consistency's sake. And yes, 'near' is not quite right.
>> How about 'within'? Or 'vicinity'?
> 
> Those all seem fine to me.  I'd cast my vote for "proximity" just because you
> chose to call the class "ProximityQuery" and an exact name match seems easiest
> to remember, but "within" is a little easier to spell and just has a slightly
> more "natural language" linguistic emphasis as opposed to more traditional
> "noun = value" naming style.

I like 'within' -- easy (enough) to remember and type.

changes pushed in r5942.

thanks for the thorough review, Marvin.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 21, 2010 at 08:50:10PM -0500, Peter Karman wrote:
> > The current implementation has a limitation I think is probably pretty
> > important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.
> 
> As you noted earlier in this thread, there is no concensus about what a
> proximity query is. :)

Touché!

> I did consider the fact that proximity might imply that order does not matter.
> But I came down here: if I want order to matter, and the ProximityScorer ignores
> order as you're suggesting, then I have no options. I can't limit my search to
> 'a NEAR b'.
> 
> If instead we leave the ProximityScorer as is, then this:
> 
>  (a NEAR b) OR (b NEAR a)
> 
> does what you're describing.

Truth.

Foolish me didn't realize it had been a conscious choice.

> Consider too:
> 
>  (a NEAR b NEAR c)
> 
> which might be written as:
> 
>  "a b c"~10
> 
> What order should I consider there? 'a' within 10 positions of 'b' and 'c'? or
> 'b' within 10 positions of 'a' and 'c'? or... You see how the possibilities
> multiply.

OK, I can see how the more limited semantics that you chose for ProximityQuery
are actually liberating under many circumsances.  

We should zap my TODO test.

> I think simpler is better here: if you want order to not matter, then OR
> together the various orders you might be interested in. In fact, I may offer
> that as an option in the Search::Query::Parser, which could then do the ORing
> programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS
> QueryParser, could do something similar.

I'd rather shunt people who need more than the basic syntax of the core
QueryParser towards yours than try to imitate it.  :)

> > Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
> > but "proximity" is better) instead of "near" for the name of that parameter.
> > Or alternately, "slop", but I understand why you went with nearness instead.
> 
> I like 'proximity' for consistency's sake. And yes, 'near' is not quite right.
> How about 'within'? Or 'vicinity'?

Those all seem fine to me.  I'd cast my vote for "proximity" just because you
chose to call the class "ProximityQuery" and an exact name match seems easiest
to remember, but "within" is a little easier to spell and just has a slightly
more "natural language" linguistic emphasis as opposed to more traditional
"noun = value" naming style.

Marvin Humphrey


Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/21/10 3:07 PM:
> On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote:
>> Marvin, please have a look when you have a chance, and let me know what needs
>> changing.
> 
> The current implementation has a limitation I think is probably pretty
> important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.
> 

As you noted earlier in this thread, there is no concensus about what a
proximity query is. :)

I did consider the fact that proximity might imply that order does not matter.
But I came down here: if I want order to matter, and the ProximityScorer ignores
order as you're suggesting, then I have no options. I can't limit my search to
'a NEAR b'.

If instead we leave the ProximityScorer as is, then this:

 (a NEAR b) OR (b NEAR a)

does what you're describing.

Consider too:

 (a NEAR b NEAR c)

which might be written as:

 "a b c"~10

What order should I consider there? 'a' within 10 positions of 'b' and 'c'? or
'b' within 10 positions of 'a' and 'c'? or... You see how the possibilities
multiply.

I think simpler is better here: if you want order to not matter, then OR
together the various orders you might be interested in. In fact, I may offer
that as an option in the Search::Query::Parser, which could then do the ORing
programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS
QueryParser, could do something similar.

I note that one of the Lucene classes you mentioned earlier[0] makes inOrder an
option. The Lucene PhraseScorer's slop feature, however, does seem to respect
order with no option otherwise.

[0]
http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/spans/SpanNearQuery.java



> 
> Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
> but "proximity" is better) instead of "near" for the name of that parameter.
> Or alternately, "slop", but I understand why you went with nearness instead.

I like 'proximity' for consistency's sake. And yes, 'near' is not quite right.
How about 'within'? Or 'vicinity'?

> 
>> In the end it was a one-line difference in the SI_winnow_anchors implementation
>> to get the near/slop feature working. I left the original implementation intact
>> and put a switch/case wrapper around it to leave the optimization (if any)
>> intact for phrases (near==1).
> 
> This doesn't technically need changing, but to cut down on the duplicated
> code, the switch on self->near should theoretically happen here:

ah yes, that's much better.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote:
> Marvin, please have a look when you have a chance, and let me know what needs
> changing.

The current implementation has a limitation I think is probably pretty
important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.

I committed a TODO test illustrating the problem.  I think to fix it you're
going to need to...

   * Change the arrays from unsigned to signed so that subtraction doesn't
     produce any surprises.
   * Work with absolute values, so that (2 - 3) and (3 - 2) both produce a
     proximity of 1.
   * Rework both SPIN_CANDIDATES and SPIN_ANCHORS so that they bring the 
     elements in question "within range" instead of testing for ">=".  In
     other words, if "near" is 1 and the current anchor is 5, SPIN_CANDIDATES
     should stop at 4 rather than keep going until the candidate is at least 5
     as it does now.

It's going to be tricky business to get right.  You'll probably need a battery
of tests to cover the edge cases.

Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
but "proximity" is better) instead of "near" for the name of that parameter.
Or alternately, "slop", but I understand why you went with nearness instead.

> In the end it was a one-line difference in the SI_winnow_anchors implementation
> to get the near/slop feature working. I left the original implementation intact
> and put a switch/case wrapper around it to leave the optimization (if any)
> intact for phrases (near==1).

This doesn't technically need changing, but to cut down on the duplicated
code, the switch on self->near should theoretically happen here:

--- ../core/KSx/Search/ProximityScorer.c    (revision 5936)
+++ ../core/KSx/Search/ProximityScorer.c    (working copy)
@@ -352,8 +352,14 @@
         
         // Splice out anchors that don't match the next term.  Bail out if
         // we've eliminated all possible anchors.
-        anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
-            candidates_start, candidates_end, i, self->near);
+        if (self->near == 1) { // optimized case
+            anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
+                candidates_start, candidates_end, i, 1);
+        }
+        else { // punt case
+            anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
+                candidates_start, candidates_end, i, self->near);
+        }
         if (!anchors_remaining) { return 0.0f; }
 
         // Adjust end for number of anchors that remain. 


Note the compile-time constant "1" being passed to the static inline function
SI_winnow_anchors.  That's sufficient for an optimizing compiler to look into
the body of SI_winnow_anchors and replace all instances of the variable with
the compile-time constant, potentially finding optimizations for that one
inline use.  At least in theory, it's not necessary to create
SI_winnow_anchors1 and SI_winnow_anchorsN.

It's difficult to verify that the compiler exploited the intended
optimization, though, because you need to look at the assembler.  Hard to
write a test case for that. 

Marvin Humphrey


Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Peter Karman wrote on 3/19/10 10:08 PM:

> 
> I'm going to dive into the Proximity classes now and see if I can break them.
> 

r5936 implements ProximityQuery for KS.

Marvin, please have a look when you have a chance, and let me know what needs
changing.

In the end it was a one-line difference in the SI_winnow_anchors implementation
to get the near/slop feature working. I left the original implementation intact
and put a switch/case wrapper around it to leave the optimization (if any)
intact for phrases (near==1).

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/19/10 2:35 PM:

> If you can find the time, I'd a brainlog helpful to see whether it was enough,
> too much, properly focused, etc.

Aesthetic comment: I like C++ style comments for legibility and speed-of-writing.

The clarification in the one- and two-line comments is helpful.

The longer block of comments around PhraseScorer_calc_phrase_freq is helpful.
The concept of "phrase frequency" had not scored well on my grok-o-meter.

The capitalization of function names confuses me (not specific to this
revision). I see PhraseScorer_Calc_Phrase_Freq and
PhraseScorer_calc_phrase_freq. I know intuitively that somehow that convention
must be internally consistent with the magic of Clownfish, etc., so I'm guessing
I just haven't yet come across where the difference in case is documented.

All those GOTO calls are indeed "non-standard form" (wink wink, nudge nudge) and
were what sparked my initial question to the list.

I'm going to dive into the Proximity classes now and see if I can break them.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 19, 2010 at 11:46:30AM -0500, Peter Karman wrote:
> > As a first step, how about I sweep through and expand the comments in
> > PhraseScorer?
> > 
> > It would be similar to what I'd say on the mailing list explaining the
> > algorithm, only we'll actually make progress on the code. :)
> 
> sounds good.

OK, I've finished.

If you can find the time, I'd a brainlog helpful to see whether it was enough,
too much, properly focused, etc.

> you'll notice I checked in stubs for the new Proximity stuff last night
> to trunk. I wasn't sure if there was a way to re-use some of the Phrase
> classes, so I just created a parallel structure, including test file,
> for the Proximity Scorer/Query classes.

Yes, it should be separate, since we'd like to spin it off once there's a
workable C API.

Marvin Humphrey


Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 03/19/2010 11:06 AM:
> On Thu, Mar 18, 2010 at 10:04:49PM -0500, Peter Karman wrote:
>>> So, I think what we should do is clean up PhraseScorer so that it is clearer,
>>> then create ProximityScorer by cloning and modding it.  It's a mild violation
>>> of DRY, but that doesn't bother me.  All of us will benefit from the cleanup,
>>> and you'll walk away with a thorough understanding of the algorithm and a
>>> top-flight ProximityScorer.
>> ok. I'll go this route, creating a new KSx::Search::ProximityScorer. I'll send
>> some patches along when I have something useful.
> 
> As a first step, how about I sweep through and expand the comments in
> PhraseScorer?
> 
> It would be similar to what I'd say on the mailing list explaining the
> algorithm, only we'll actually make progress on the code. :)

sounds good.

you'll notice I checked in stubs for the new Proximity stuff last night
to trunk. I wasn't sure if there was a way to re-use some of the Phrase
classes, so I just created a parallel structure, including test file,
for the Proximity Scorer/Query classes.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Mar 18, 2010 at 10:04:49PM -0500, Peter Karman wrote:
> > So, I think what we should do is clean up PhraseScorer so that it is clearer,
> > then create ProximityScorer by cloning and modding it.  It's a mild violation
> > of DRY, but that doesn't bother me.  All of us will benefit from the cleanup,
> > and you'll walk away with a thorough understanding of the algorithm and a
> > top-flight ProximityScorer.
> 
> ok. I'll go this route, creating a new KSx::Search::ProximityScorer. I'll send
> some patches along when I have something useful.

As a first step, how about I sweep through and expand the comments in
PhraseScorer?

It would be similar to what I'd say on the mailing list explaining the
algorithm, only we'll actually make progress on the code. :)

Marvin Humphrey


Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/17/10 11:04 AM:

> So, I think what we should do is clean up PhraseScorer so that it is clearer,
> then create ProximityScorer by cloning and modding it.  It's a mild violation
> of DRY, but that doesn't bother me.  All of us will benefit from the cleanup,
> and you'll walk away with a thorough understanding of the algorithm and a
> top-flight ProximityScorer.

ok. I'll go this route, creating a new KSx::Search::ProximityScorer. I'll send
some patches along when I have something useful.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Mar 16, 2010 at 10:14:32PM -0500, Peter Karman wrote:
> > Within the existing KS code base, PhraseScorer would be the closest thing
> > to what you want.  It wasn't really built to handle nearness, but maybe it
> > can be adapted.
> 
> My (perhaps naive) assumption was that a PhraseScorer isa ProximityScorer
> where proximity==1.

The present implementation of PhraseScorer not generalized for variable
proximity.  It looks for exact matches...

    got == wanted 

... rather than checking for slop:

    abs(got - wanted) < slop

I think it's possible to mod the position-matching algorithm without affecting
performance.[1]  However, I'm concerned about two things.  

First, the code is apparently not clear enough today for you to understand it
just by spelunking -- despite your substantial expertise, it was necessary to
ask on the list.  That tells me we shouldn't be adding to it but rather
refactoring it for simplicity and clarity first.

(We don't need to worry about optimizing the matching algorithm further.  It
was fast when I finished it, and then Nate went to town and streamlined it
further.  So refactoring should focus on superficial organization and
comments.)

Second, everyone understands what constitutes an exact phrase match, but
there's no consensus about what constitutes a sloppy phrase match.  I think
the core PhraseScorer should stay focused on canonical phrase matching rather
than branch out.

So, I think what we should do is clean up PhraseScorer so that it is clearer,
then create ProximityScorer by cloning and modding it.  It's a mild violation
of DRY, but that doesn't bother me.  All of us will benefit from the cleanup,
and you'll walk away with a thorough understanding of the algorithm and a
top-flight ProximityScorer.

> > Do you have an idea yet as to how you might publish this?
> 
> I need to understand how the phrase matching is done currently (see above).
> If I could contribute it to the KS core, I'd be happy to. Otherwise, I
> imagine adding it to Search::Query::Dialect::KSx as another *Query type,
> joining the Wildcard features.

I think the core should be limited to canonical query types, and that
therefore ProximityQuery should be implemented as an extension.  In the
interest of time and convenience, though, we should probably treat it the same
way that KSx::Search::Filter is treated today, and build it into the main
distro, just under KSx.  Once we have a decent C API, we should seek to spin
it off.

Alternately, you could write a pure-Perl implementation, but then PhraseScorer
wouldn't get a housecleaning, and it would actually be a PITA for you to port
all that C code rather than dupe it and make limited modifications --
especially if not all the necessary information is available at the Perl level
(which it probably isn't.)

Marvin Humphrey

[1] We can change SI_winnow_anchors to be take a slop param, then case the
    call to it with either 0 or non-zero.  In the 0 case, an optimizing
    compiler will have all the information it needs to build the exact-match
    version.  See <https://issues.apache.org/jira/browse/LUCY-99> for an
    example of this technique.


Re: ProximityQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/15/10 11:49 PM:

> Within the existing KS code base, PhraseScorer would be the closest thing to
> what you want.  It wasn't really built to handle nearness, but maybe it can be
> adapted.

My (perhaps naive) assumption was that a PhraseScorer isa ProximityScorer where
proximity==1.

> 
> Do you have an idea yet as to how you might publish this?

I need to understand how the phrase matching is done currently (see above). If I
could contribute it to the KS core, I'd be happy to. Otherwise, I imagine adding
it to Search::Query::Dialect::KSx as another *Query type, joining the Wildcard
features.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: ProximityQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Mar 15, 2010 at 10:57:28PM -0500, Peter Karman wrote:
> I'd like to offer a proximity query type in my app, so that I can search like:
> 
>  foo NEAR10 bar
> 
> to find all instances of 'foo' within 10 token positions of 'bar'.[0]
> 
> It seems like the place to start, if I were to take the route of
> subclassing/extending an existing class, is the PhraseQuery feature,
> specifically the PhraseScorer and the internal winnow_anchors() function. Am I
> on the right track here?

As you seem to have noted already, the hard part will be the Matcher class,
not the Query.

Within the existing KS code base, PhraseScorer would be the closest thing to
what you want.  It wasn't really built to handle nearness, but maybe it can be
adapted.

If you want to see other prior art, Lucene has SpanNearQuery
and SpanScorer:

http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/spans/SpanNearQuery.java

http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/spans/SpanScorer.java

Also, Lucene's PhraseScorer takes a "slop" parameter, which KinoSearch's does
not.  I forget exactly what it does and how it differs from
SpanNearQuery/SpanScorer.

http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/PhraseScorer.java

> [0] I believe Lucene syntax for that query is "foo bar"~10

Yes.  

http://lucene.apache.org/java/3_0_1/queryparsersyntax.html#Proximity%20Searches

That '10' is the 'slop' parameter.

Do you have an idea yet as to how you might publish this?

Marvin Humphrey