You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2011/03/01 04:20:46 UTC

Re: [lucy-user] Question about query parsing API

Andrew S. Townley wrote on 2/26/11 6:55 AM:

> 
> At the very least, I need to be able to walk the query tree for any string
> input query (or any query object) with a consistent API.  What you have here
> is pretty similar to my own implementation of the Query Object pattern for my
> system, so that would be a start.  For any compound query term, I'm also
> "bubbling" up references to the property names as well as the query terms
> themselves.  This means that I can retrieve these easily and do some analysis
> on the query before actually executing it.
> 
> Anything that will support me doing the same type of thing with Lucy will
> work.
> 

It's not core to Lucy, but Search::Query::Dialect has a KSx (KinoSearch
extension) implementation that makes it easy to walk the tree:

http://search.cpan.org/~karman/Search-Query-0.18/lib/Search/Query/Dialect.pm#walk(_CODE_)

> 
>> There are portions of Lucy that have been intentionally left unimplemented
>> by the core.  The Perl implementation code is located in trunk/perl/xs/
>> and trunk/perl/lib/Lucy.pm.  This code will have to be ported for each new
>> host language regardless.
> 
> Interesting approach.  Is there some docs/rationale on which parts and why
> somewhere?  Sounds worth understanding in more detail.

Marvin can answer as to whether there are docs on this; my understanding of the
rationale is that since our goal is idiomatic language implementations on top of
the underlying C, each host language must do *some* work.

Lucy isn't a C library in the traditional sense; it's more like a
some-assembly-required C framework for writing an IR library in a dynamic
language. The C code handles the heavy-lifting bits that are too
resource-intensive to be practical in the host language. Then each host language
must glue it all together. Clownfish provides a way of generating (most of) that
glue.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Per-host abstract elements

Posted by "Andrew S. Townley" <as...@atownley.org>.
On 2 Mar 2011, at 1:59 AM, Peter Karman wrote:

> Marvin Humphrey wrote on 3/1/11 4:08 PM:
> 
>> Regarding *which* pieces of the core library are to be left unimplemented, a
>> systematic discussion has never taken place.
> 
> Thanks for starting one!
> 
> I have no doubt I'll be referring back to this email in future.

Agreed!  Thanks for posting all the detailed information, Marvin.  Really appreciate it, and I think it'll be useful to everyone.

Cheers!

ast
--
Andrew S. Townley <as...@atownley.org>
http://atownley.org


Re: [lucy-dev] Per-host abstract elements

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/1/11 4:08 PM:

> Regarding *which* pieces of the core library are to be left unimplemented, a
> systematic discussion has never taken place.

Thanks for starting one!

I have no doubt I'll be referring back to this email in future.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

[lucy-dev] Per-host abstract elements

Posted by Marvin Humphrey <ma...@rectangular.com>.
(moving to lucy-dev...)

On Mon, Feb 28, 2011 at 09:20:46PM -0600, Peter Karman wrote:
> > Interesting approach.  Is there some docs/rationale on which parts and why
> > somewhere?  Sounds worth understanding in more detail.
> 
> Marvin can answer as to whether there are docs on this; my understanding of the
> rationale is that since our goal is idiomatic language implementations on top of
> the underlying C, each host language must do *some* work.

There's this passage from the DevGuide:

  The C core is intentionally left incomplete, however; to be usable, it must
  be bound to a "host" language.  (In this context, even C is considered a
  "host" which must implement the missing pieces and be "bound" to the core.)
  Some of the binding code is autogenerated by Clownfish on a spec customized
  for each language.  Other pieces are hand-coded in either C (using the
  host's C API) or the host language itself.

There's also documentation within the main module page for the Clownfish
compiler at trunk/clownfish/lib/Clownfish.pm.  

Regarding *which* pieces of the core library are to be left unimplemented, a
systematic discussion has never taken place.  Some of the individual modules
have been discussed, e.g. Lucy::Document::Doc, and aspects of the object model
have been discussed on lucy-dev going back to the initial brainstorming Dave
Balmain and I did in 2006 -- but there hasn't been high-level discussion of
the complete whole as to what should be left abstract.

Historically, the codebase that is now Lucy began as a library that was mostly
Perl with some hand-coded XS (KinoSearch 0.1x).  There has been an ongoing
effort to port that codebase to C; for the core library, that task is mostly
done, while other components have reached various stages of completion:

    Clownfish compiler:  c. 50-60% done (under active development)
    Charmonizer:         done
    Test suite for core: c. 50% done

Because the porting effort is incomplete, though, the trunk/perl/ directory
contains more than it should.  It's not necessary to port everything in there
to create a binding to another language, and its contents should not be taken
as the end product of a coherent design effort.

The abstract chunks within the core library have various rationales.  The most
important is that as a community we care a great deal about user-friendly API
design: we started with a nice Perl API and we have been unwilling to
sacrifice its most important facets.  However, some abstract chunks remain
unimplemented just because implementing them is hard, impractical, or unwise.

The biggest unimplented piece is the "fields" member in Lucy::Document::Doc,
which is left to be a native mapping type: Perl hash, Ruby Hash, Python dict,
etc.  The rationales are convenience and to a lesser extent minimizing string
copies.  Having "fields" left abstract necessitates custom code in the
following files:

    perl/xs/Lucy/Document/Doc.c
    perl/xs/Lucy/Index/DocReader.c
    perl/xs/Lucy/Index/Inverter.c

(I recall that the idea of using "overload" to get at the doc object's fields
originated with Father Chrysostomos.)

CaseFolder, Tokenizer and StringHelper are left incomplete because we want to
rely on the host language to supply a regex engine and complex unicode
processing rather than write/bundle the code to do that.

    perl/xs/Lucy/Analysis/CaseFolder.c
    perl/xs/Lucy/Analysis/Tokenizer.c
    perl/xs/Lucy/Util/StringHelper.c

We rely on the host language for exception handling.  This has a big impact on
Lucy::Object::Err, but it also affects some other classes which have to catch
exceptions during normal operation.

    perl/xs/Lucy/Object/Err.c
    perl/xs/Lucy/Index/PolyReader.c
    perl/xs/Lucy/Index/SegReader.c

FSFolder is left incomplete because the "absolutify" function (which
transforms relative paths to absolute paths) hasn't been ported.

    perl/xs/Lucy/Store/FSFolder.c
    
Lucy::Util::Json is left incomplete because we haven't yet replaced our usage
of the CPAN module JSON::XS with either a bundled C library or a hand-rolled
parser based on the Lemon parser generator.

    perl/xs/Lucy/Util/Json.c

Lucy::Object::Obj caches a host object and in Perl at least, uses it for
reference counting.  It's not clear exactly what we'll do in a
garbage-collected language like Ruby, but the design was discussed at
<http://markmail.org/message/jkst23okksyynzss>.

    perl/xs/Lucy/Object/Obj.c

Lucy::Object::VTable contains code which walks the host language's OO
hierarchy, and discovers when the user has supplied a method which should
override a core method.  When a dynamic VTable is created for a user-defined
subclass, a callback is automatically installed which invokes the overriding
subroutine.

    perl/xs/Lucy/Object/VTable.c

Lucy::Object::Host implements the mechanism by which core code calls back into
the host language.
    
    perl/xs/Lucy/Object/Host.c

Lucy::Object::LockFreeRegistry is an oddball class, used only for one purpose:
thread-safe access to VTable singletons.  There are a few lines of esoteric
code needed in its Perl binding due to the fact that it must be accessible
from multiple threads.

    perl/xs/Lucy/Object/LockFreeRegistry.c

The Perl module perl/lib/Lucy.pm now houses pure Perl code which was
previously spread across multiple files.  It contains some of the actual
implementation code which the C files call back to.  For instance,
perl/xs/Lucy/Util/Json.c contains glue code which invokes callbacks to Perl
subroutines defined in Lucy.pm:

... interface definition in core/Lucy/Util/Json.cfh...

    /** Encode <code>dump</code> as JSON.
     */
    inert incremented CharBuf* 
    to_json(Obj *dump);

... glue code in perl/xs/Lucy/Util/Json.c...

    CharBuf*
    Json_to_json(Obj *dump)
    {
        return Host_callback_str(JSON, "to_json", 1,
            ARG_OBJ("dump", dump));
    }

... and implementation code in perl/lib/Lucy.pm:

    sub to_json {
        my ( undef, $dump ) = @_;
        return $json_encoder->encode($dump);
    }

Lastly, there is code which performs conversions between Lucy data structures
and host data structures and which performs parameter validation and argument
handling.

    perl/xs/XSBind.h
    perl/xs/XSBind.c

At some point, the contents of those XSBind modules will likely move
underneath clownfish/.  Other code, e.g. the Json materials, will simply
vanish as we create pure C implementations in core/.

Marvin Humphrey


Re: [lucy-user] Question about query parsing API

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Feb 28, 2011 at 09:20:46PM -0600, Peter Karman wrote:
> > Interesting approach.  Is there some docs/rationale on which parts and why
> > somewhere?  Sounds worth understanding in more detail.
> 
> Marvin can answer as to whether there are docs on this; my understanding of the
> rationale is that since our goal is idiomatic language implementations on top of
> the underlying C, each host language must do *some* work.

I've replied to this on lucy-dev, under the subject heading "Per-host abstract
elements".

Marvin Humphrey