You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by "David E. Wheeler" <da...@kineticode.com> on 2011/03/18 21:29:10 UTC

[lucy-dev] A Schema for PGXN

Howdy Lucites,

I'm starting work on an index for PGXN. Not heard of PGXN? Think of it as CPAN for PostgreSQL.

  http://pgxn.org/

Anyway, the things I want to index are:

* Distributions. Includes name, version, tags, abstract, description, user, and some other stuff.

* Extensions (modules in CPAN-speak). Mainly documentation in HTML.

* Tags. Contains a list of distributions associated with tags.

* User. Includes name, email, URL, twitter nick, and a list of distributions.

* Documentation: Random docs associated with a distribution but not a specific extension

By default, a user will be able to search all these things at once. So I was thinking that I'd have just one schema/index, and use categories to separate the different objects. Given that, I was thinking of a schema with:

Title:     Name of a distribution/extension/tag/user
Abstract:  For distributions and extensions
Content:   Description and random docs for distributions,
           documentation body for extensions, distribution names
	   for users and tags
Tags:      Tags associated with an distribution
Metadata:  Additional metadata: email addresses, URLs, dates,
           and other stuff associated with a distribution.

So for those fields that don't apply to a thing, like "tags" for a tag object, I'd just provide no value. Otherwise, I'd like to do a full-text search on all these fields.

So, does this seem like a reasonable search schema? I would appreciate any feedback and suggestions.

Thanks!

David


Re: [lucy-dev] A Schema for PGXN

Posted by "David E. Wheeler" <da...@kineticode.com>.
On Mar 18, 2011, at 2:36 PM, Marvin Humphrey wrote:

> Here's how I would express your schema in code:
> 
>    my $schema = Lucy::Plan::Schema->new;
>    my $polyanalyzer  = Lucy::Analysis::PolyAnalyzer->(language => 'en');
>    my $fulltext_type = Lucy::Plan::FullTextType(
>        analyzer      => $polyanalyzer,
>        highlightable => 1,             # maybe
>    );
>    $schema->spec_field(name => 'Title',    type => $fulltext_type);
>    $schema->spec_field(name => 'Abstract', type => $fulltext_type);
>    $schema->spec_field(name => 'Content',  type => $fulltext_type);
>    my $pipe_toker = Lucy::Analysis::RegexTokenizer->new(pattern => '[^|]+'); 
>    my $pipe_type  = Lucy::Plan::FullTextType->new(analyzer => $pipe_toker);
>    $schema->spec_field(name => 'Tags',     type => $pipe_type);
>    $schema->spec_field(name => 'Metadata', type => $pipe_type);
> 
> I think that's the most straightforward way to start out.  From there, you can
> tweak and try other options as necessary.

Thanks. I'm using KS, though. It's the same interface, right?

>> So for those fields that don't apply to a thing, like "tags" for a tag
>> object, I'd just provide no value. Otherwise, I'd like to do a full-text
>> search on all these fields.
> 
> The default behavior of Lucy's QueryParser is to search all indexed fields.
> The weighting's going to get a little weird with the Tags and Metadata fields
> because of length normalization, but that's something to wrestle with later.

I don't understand what that means, sorry.

David



Re: [lucy-dev] A Schema for PGXN

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 18, 2011 at 01:29:10PM -0700, David E. Wheeler wrote:
> Title:     Name of a distribution/extension/tag/user
> Abstract:  For distributions and extensions
> Content:   Description and random docs for distributions,
>            documentation body for extensions, distribution names
> 	         for users and tags
> Tags:      Tags associated with an distribution
> Metadata:  Additional metadata: email addresses, URLs, dates,
>            and other stuff associated with a distribution.

Here's how I would express your schema in code:

    my $schema = Lucy::Plan::Schema->new;
    my $polyanalyzer  = Lucy::Analysis::PolyAnalyzer->(language => 'en');
    my $fulltext_type = Lucy::Plan::FullTextType(
        analyzer      => $polyanalyzer,
        highlightable => 1,             # maybe
    );
    $schema->spec_field(name => 'Title',    type => $fulltext_type);
    $schema->spec_field(name => 'Abstract', type => $fulltext_type);
    $schema->spec_field(name => 'Content',  type => $fulltext_type);
    my $pipe_toker = Lucy::Analysis::RegexTokenizer->new(pattern => '[^|]+'); 
    my $pipe_type  = Lucy::Plan::FullTextType->new(analyzer => $pipe_toker);
    $schema->spec_field(name => 'Tags',     type => $pipe_type);
    $schema->spec_field(name => 'Metadata', type => $pipe_type);

I think that's the most straightforward way to start out.  From there, you can
tweak and try other options as necessary.

> So for those fields that don't apply to a thing, like "tags" for a tag
> object, I'd just provide no value. Otherwise, I'd like to do a full-text
> search on all these fields.

The default behavior of Lucy's QueryParser is to search all indexed fields.
The weighting's going to get a little weird with the Tags and Metadata fields
because of length normalization, but that's something to wrestle with later.

Marvin Humphrey