You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Aleksandar Radovanovic <Al...@Radovanovic.com> on 2012/12/29 16:22:01 UTC

[lucy-user] New feature suggestion

Hi there,

I was wondering, would it be possible to add a new feature  to the
indexing engine (or somehow simulate it) that will do EXACTLY opposite
of Lucy::Analysis::SnowballStopFilter? In other words, instead of
blocking a list of stopwords, indexing engine will index ONLY phrases
supplied in the user list to the exact match. Or even better, prioritize
them for indexing: index the user list first and then use Lucy analyzer
for words that are not in the list.

Why this can be useful? In chemistry for example, it is simply
impossible to create a rule that will index chemical names correctly (
e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of
thousands). Also, in a biomedical text some seemingly common words can
for example, represent a gene or protein name which should not be
stemmed.  To summarize, this feature will allow one to create a correct
index(es) of specialized terms.

Alex

Re: [lucy-user] New feature suggestion

Posted by Aleksandar Radovanovic <Al...@Radovanovic.com>.
On 12/30/12 7:12 PM, Peter Karman wrote:
> Aleksandar Radovanovic wrote on 12/30/12 5:21 AM:
>
>> Thank you Marvin, I tried what you have suggested! It works fine, but my
>> main problem still remains: how to find and index *predefined* phrases.
>> In your example this boils down to the implementation of 
>> /extract_chem_names($content). /
>>
>> I was hoping to use some Lucy functionality for this - indexing the
>> whole text, searching the index for predefined phrases and index them
>> separately. But this does not work correctly for biomedical documents in
>> which text often looks like random sequence of weird characters, and
>> strange, no-language words which Lucy simply skips, or stems incorrectly.
>>
>> So, the core of my idea is to have something opposite to stopwords. A
>> list of phrases which will be indexed without stemmer - exactly as they
>> appear in the user supplied list. I was wondering why such a simple and
>> obvious feature was not implemented - or am I missing something?
>>
> You're missing something. Stopword filtering happens *after* tokenizing in the
> analysis chain; so too would your Goword filter. It's the tokenizing that's
> problematic.
>
> The problem isn't the lack of a GoWordFilter, it's the lack of a ChemTokenizer:
> how to tokenize a block of text that contains *both* chemical strings and
> narrative strings. It's like trying to apply an English stemmer to a text that
> contains both English and French. The problem is: how to apply the rules for one
> grammar against a text that contains mixed grammars that use the same alphabet.
> Writing a single regex is practically impossible.
>
> If you just wanted to pull out the chemical strings from your text, and ignore
> everything else, that would be a fairly straightforward task. If you wanted to
> ignore all the chemical strings, that too would be straightforward (that's what
> basically happens by default). But you seem to want to combine them. That's not
> simple or straightforward.
>
> Marvin's suggestion tries to address the complexity you're after. If what you're
> missing is an implementation of extract_chem_names(), that seems like a suitable
> exercise for you to undertake, since that requires domain-specific knowledge. I
> might start with something naive like:
>
>  my @chem_names = (
>      'NH4+/H+K+/NH4+(H+)',
>      '[Hg(CN)2]',
>      'Ca(.-)',
>  );
>
>  sub extract_chem_names {
>     my $text = shift;
>     my @matches;
>     for my $n (@chem_names) {
>        my $esc = quotemeta($n);
>        if ($text =~ m/$esc/) {
>            push @matches, $n;
>        }
>     }
>     return \@matches;
>  }
>
>
>

I see it clearly now. To express it in Lucy syntax, I would need some
expanded polyanalyzer|:|||||

my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new (
    dictionaries => [ $chemicals, $genes, $human_anatomy ],
    language => 'en',
);

Since such a magic does not (yet:-) exists, I'll follow your advice.
Marvin, Peter, thank you so much for all your help!

Regards, Alex


Re: [lucy-user] New feature suggestion

Posted by Peter Karman <pe...@peknet.com>.
Aleksandar Radovanovic wrote on 12/30/12 5:21 AM:

> Thank you Marvin, I tried what you have suggested! It works fine, but my
> main problem still remains: how to find and index *predefined* phrases.
> In your example this boils down to the implementation of 
> /extract_chem_names($content). /
> 
> I was hoping to use some Lucy functionality for this - indexing the
> whole text, searching the index for predefined phrases and index them
> separately. But this does not work correctly for biomedical documents in
> which text often looks like random sequence of weird characters, and
> strange, no-language words which Lucy simply skips, or stems incorrectly.
> 
> So, the core of my idea is to have something opposite to stopwords. A
> list of phrases which will be indexed without stemmer - exactly as they
> appear in the user supplied list. I was wondering why such a simple and
> obvious feature was not implemented - or am I missing something?
> 

You're missing something. Stopword filtering happens *after* tokenizing in the
analysis chain; so too would your Goword filter. It's the tokenizing that's
problematic.

The problem isn't the lack of a GoWordFilter, it's the lack of a ChemTokenizer:
how to tokenize a block of text that contains *both* chemical strings and
narrative strings. It's like trying to apply an English stemmer to a text that
contains both English and French. The problem is: how to apply the rules for one
grammar against a text that contains mixed grammars that use the same alphabet.
Writing a single regex is practically impossible.

If you just wanted to pull out the chemical strings from your text, and ignore
everything else, that would be a fairly straightforward task. If you wanted to
ignore all the chemical strings, that too would be straightforward (that's what
basically happens by default). But you seem to want to combine them. That's not
simple or straightforward.

Marvin's suggestion tries to address the complexity you're after. If what you're
missing is an implementation of extract_chem_names(), that seems like a suitable
exercise for you to undertake, since that requires domain-specific knowledge. I
might start with something naive like:

 my @chem_names = (
     'NH4+/H+K+/NH4+(H+)',
     '[Hg(CN)2]',
     'Ca(.-)',
 );

 sub extract_chem_names {
    my $text = shift;
    my @matches;
    for my $n (@chem_names) {
       my $esc = quotemeta($n);
       if ($text =~ m/$esc/) {
           push @matches, $n;
       }
    }
    return \@matches;
 }



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] New feature suggestion

Posted by Aleksandar Radovanovic <Al...@Radovanovic.com>.
On 12/30/12 4:22 AM, Marvin Humphrey wrote:
> On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic
> <Al...@radovanovic.com> wrote:
>> I was wondering, would it be possible to add a new feature  to the
>> indexing engine (or somehow simulate it) that will do EXACTLY opposite
>> of Lucy::Analysis::SnowballStopFilter? In other words, instead of
>> blocking a list of stopwords, indexing engine will index ONLY phrases
>> supplied in the user list to the exact match. Or even better, prioritize
>> them for indexing: index the user list first and then use Lucy analyzer
>> for words that are not in the list.
>>
>> Why this can be useful? In chemistry for example, it is simply
>> impossible to create a rule that will index chemical names correctly (
>> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of
>> thousands). Also, in a biomedical text some seemingly common words can
>> for example, represent a gene or protein name which should not be
>> stemmed.  To summarize, this feature will allow one to create a correct
>> index(es) of specialized terms.
> I think you could achieve this now by extracting the list of terms yourself
> prior to indexing and using a custom RegexTokenizer.
>
>     my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+');
>     my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer);
>     $schema->spec_field(name => 'chemicals', type => $type);
>
>     ...
>
>     my @chemical_names = extract_chem_names($content);
>     my $chem_content = join(' ', @chemical_names);
>     $indexer->add_doc({
>         content   => $content,
>         chemicals => $chem_content,
>         ...
>     });
>
> If the chemical names may contain whitespace, I'd suggest using "\x1F", the
> ASCII "unit separator", as a delimiter.
>
>     my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
>         pattern => '[^\\x1F]+'
>     );
>
>     ...
>
>     my $chem_content = join("\x1F", @chemical_names);
>
> At search-time, you'd need to duplicate the transform and feed the content to
> an extra QueryParser.
>
>     my $main_parser = Lucy::Search::QueryParser->new(
>         schema => $searcher->get_schema,
>     );
>     my $chem_parser = Lucy::Search::QueryParser->new(
>         schema => $searcher->get_schema,
>         fields => ['chemicals'],
>     );
>     my $main_query = $main_parser->parse($query_string);
>     my $chem_query = $chem_parser->parse(extract_chem_names($query_string));
>     my $or_query = Lucy::Search::ORQuery->new(
>         children => [$main_query, $chem_query],
>     );
>     my $hits = $searcher->hits(query => $or_query);
>     ...
>
> The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you
> some ideas as well.
>
> Cheers,
>
> Marvin Humphrey
>
>
Thank you Marvin, I tried what you have suggested! It works fine, but my
main problem still remains: how to find and index *predefined* phrases.
In your example this boils down to the implementation of 
/extract_chem_names($content). /

I was hoping to use some Lucy functionality for this - indexing the
whole text, searching the index for predefined phrases and index them
separately. But this does not work correctly for biomedical documents in
which text often looks like random sequence of weird characters, and
strange, no-language words which Lucy simply skips, or stems incorrectly.

So, the core of my idea is to have something opposite to stopwords. A
list of phrases which will be indexed without stemmer - exactly as they
appear in the user supplied list. I was wondering why such a simple and
obvious feature was not implemented - or am I missing something?

Alex

Re: [lucy-user] New feature suggestion

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic
<Al...@radovanovic.com> wrote:
> I was wondering, would it be possible to add a new feature  to the
> indexing engine (or somehow simulate it) that will do EXACTLY opposite
> of Lucy::Analysis::SnowballStopFilter? In other words, instead of
> blocking a list of stopwords, indexing engine will index ONLY phrases
> supplied in the user list to the exact match. Or even better, prioritize
> them for indexing: index the user list first and then use Lucy analyzer
> for words that are not in the list.
>
> Why this can be useful? In chemistry for example, it is simply
> impossible to create a rule that will index chemical names correctly (
> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of
> thousands). Also, in a biomedical text some seemingly common words can
> for example, represent a gene or protein name which should not be
> stemmed.  To summarize, this feature will allow one to create a correct
> index(es) of specialized terms.

I think you could achieve this now by extracting the list of terms yourself
prior to indexing and using a custom RegexTokenizer.

    my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+');
    my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer);
    $schema->spec_field(name => 'chemicals', type => $type);

    ...

    my @chemical_names = extract_chem_names($content);
    my $chem_content = join(' ', @chemical_names);
    $indexer->add_doc({
        content   => $content,
        chemicals => $chem_content,
        ...
    });

If the chemical names may contain whitespace, I'd suggest using "\x1F", the
ASCII "unit separator", as a delimiter.

    my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
        pattern => '[^\\x1F]+'
    );

    ...

    my $chem_content = join("\x1F", @chemical_names);

At search-time, you'd need to duplicate the transform and feed the content to
an extra QueryParser.

    my $main_parser = Lucy::Search::QueryParser->new(
        schema => $searcher->get_schema,
    );
    my $chem_parser = Lucy::Search::QueryParser->new(
        schema => $searcher->get_schema,
        fields => ['chemicals'],
    );
    my $main_query = $main_parser->parse($query_string);
    my $chem_query = $chem_parser->parse(extract_chem_names($query_string));
    my $or_query = Lucy::Search::ORQuery->new(
        children => [$main_query, $chem_query],
    );
    my $hits = $searcher->hits(query => $or_query);
    ...

The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you
some ideas as well.

Cheers,

Marvin Humphrey