You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by "David E. Wheeler" <da...@kineticode.com> on 2011/03/18 21:29:10 UTC
[lucy-dev] A Schema for PGXN
Howdy Lucites,
I'm starting work on an index for PGXN. Not heard of PGXN? Think of it as CPAN for PostgreSQL.
http://pgxn.org/
Anyway, the things I want to index are:
* Distributions. Includes name, version, tags, abstract, description, user, and some other stuff.
* Extensions (modules in CPAN-speak). Mainly documentation in HTML.
* Tags. Contains a list of distributions associated with tags.
* User. Includes name, email, URL, twitter nick, and a list of distributions.
* Documentation: Random docs associated with a distribution but not a specific extension
By default, a user will be able to search all these things at once. So I was thinking that I'd have just one schema/index, and use categories to separate the different objects. Given that, I was thinking of a schema with:
Title: Name of a distribution/extension/tag/user
Abstract: For distributions and extensions
Content: Description and random docs for distributions,
documentation body for extensions, distribution names
for users and tags
Tags: Tags associated with an distribution
Metadata: Additional metadata: email addresses, URLs, dates,
and other stuff associated with a distribution.
So for those fields that don't apply to a thing, like "tags" for a tag object, I'd just provide no value. Otherwise, I'd like to do a full-text search on all these fields.
So, does this seem like a reasonable search schema? I would appreciate any feedback and suggestions.
Thanks!
David
Re: [lucy-dev] A Schema for PGXN
Posted by "David E. Wheeler" <da...@kineticode.com>.
On Mar 18, 2011, at 2:36 PM, Marvin Humphrey wrote:
> Here's how I would express your schema in code:
>
> my $schema = Lucy::Plan::Schema->new;
> my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->(language => 'en');
> my $fulltext_type = Lucy::Plan::FullTextType(
> analyzer => $polyanalyzer,
> highlightable => 1, # maybe
> );
> $schema->spec_field(name => 'Title', type => $fulltext_type);
> $schema->spec_field(name => 'Abstract', type => $fulltext_type);
> $schema->spec_field(name => 'Content', type => $fulltext_type);
> my $pipe_toker = Lucy::Analysis::RegexTokenizer->new(pattern => '[^|]+');
> my $pipe_type = Lucy::Plan::FullTextType->new(analyzer => $pipe_toker);
> $schema->spec_field(name => 'Tags', type => $pipe_type);
> $schema->spec_field(name => 'Metadata', type => $pipe_type);
>
> I think that's the most straightforward way to start out. From there, you can
> tweak and try other options as necessary.
Thanks. I'm using KS, though. It's the same interface, right?
>> So for those fields that don't apply to a thing, like "tags" for a tag
>> object, I'd just provide no value. Otherwise, I'd like to do a full-text
>> search on all these fields.
>
> The default behavior of Lucy's QueryParser is to search all indexed fields.
> The weighting's going to get a little weird with the Tags and Metadata fields
> because of length normalization, but that's something to wrestle with later.
I don't understand what that means, sorry.
David
Re: [lucy-dev] A Schema for PGXN
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 18, 2011 at 01:29:10PM -0700, David E. Wheeler wrote:
> Title: Name of a distribution/extension/tag/user
> Abstract: For distributions and extensions
> Content: Description and random docs for distributions,
> documentation body for extensions, distribution names
> for users and tags
> Tags: Tags associated with an distribution
> Metadata: Additional metadata: email addresses, URLs, dates,
> and other stuff associated with a distribution.
Here's how I would express your schema in code:
my $schema = Lucy::Plan::Schema->new;
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->(language => 'en');
my $fulltext_type = Lucy::Plan::FullTextType(
analyzer => $polyanalyzer,
highlightable => 1, # maybe
);
$schema->spec_field(name => 'Title', type => $fulltext_type);
$schema->spec_field(name => 'Abstract', type => $fulltext_type);
$schema->spec_field(name => 'Content', type => $fulltext_type);
my $pipe_toker = Lucy::Analysis::RegexTokenizer->new(pattern => '[^|]+');
my $pipe_type = Lucy::Plan::FullTextType->new(analyzer => $pipe_toker);
$schema->spec_field(name => 'Tags', type => $pipe_type);
$schema->spec_field(name => 'Metadata', type => $pipe_type);
I think that's the most straightforward way to start out. From there, you can
tweak and try other options as necessary.
> So for those fields that don't apply to a thing, like "tags" for a tag
> object, I'd just provide no value. Otherwise, I'd like to do a full-text
> search on all these fields.
The default behavior of Lucy's QueryParser is to search all indexed fields.
The weighting's going to get a little weird with the Tags and Metadata fields
because of length normalization, but that's something to wrestle with later.
Marvin Humphrey