You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Moritz Lenz <mo...@faui2k3.org> on 2013/07/15 18:23:29 UTC

[lucy-dev] Thank you, and site built with Lucy

Hello,

this might be slightly off-topic here, but still I'd like to thank all 
contributors for their amazing work on Lucy.

Recently I've built a search feature for my web IRC logs with Lucy, and 
I found the (Perl-)API nice to use, and the error messages very helpful 
too. (Except that one time when I managed to make lucy + perl 5.10.1 
segfault; sadly I didn't manage to reproduce it).

So, please keep up the good work!

Some details:

An example search can be found here: 
http://irclog.perlgeek.de/perl6/search/?nick=timtoady&q=threads
The backend code is here: 
https://github.com/moritz/ilbot/blob/master/lib/Ilbot/Backend/Search.pm

For the indexing I lump together all subsequent lines by the same nick 
into one document, and store the database IDs as a comma-separate value 
in a second field, and the day in a third field. Each IRC channel has a 
separate index.

Then when displaying the search results, I retrieve all lines for that 
day and channel from the database or cache (which is fast enough, and 
much simpler than building complicated queries), and filter out the 
search results, plus a few lines before and afterward for context.

So far I'm very happy with theses tradeoffs, and like the results.

Cheers,
Moritz

Re: [lucy-dev] Thank you, and site built with Lucy

Posted by Nick Wellnhofer <we...@aevum.de>.
On 16/07/2013 12:44, Moritz Lenz wrote:
> For my purposes it would be much nicer to obtain indexes into the string
> where the search term was found, (or alternatively, let me set separate
> callbacks for both the context and the search results) so that I could
> do my own processing with that information.

This should be possible via the `highlight_spans` method of 
`Lucy::Search::Compiler`. Something like the following should work, at 
least that's what the highlighter does internally:

     my $compiler = $query->make_compiler(searcher => $searcher);
     my $doc_vec  = $searcher->fetch_doc_vec($hit->get_doc_id);
     my $spans    = $compiler->highlight_spans(
         searcher => $searcher,
         doc_vec  => $doc_vec,
         field    => $field,
     );
     for my $span (@$spans) {
         my $offset = $span->get_offset;
         my $length = $span->get_length;
         my $weight = $span->get_weight;
     }

`offset` and `length` are in Unicode code points. `fetch_doc_vec` is 
undocumented, unfortunately.

Nick


Re: [lucy-dev] Thank you, and site built with Lucy

Posted by Moritz Lenz <mo...@faui2k3.org>.

On 07/16/2013 01:43 AM, Marvin Humphrey wrote:
> On Mon, Jul 15, 2013 at 9:23 AM, Moritz Lenz <mo...@faui2k3.org> wrote:
>> Some details:
>>
>> An example search can be found here:
>> http://irclog.perlgeek.de/perl6/search/?nick=timtoady&q=threads
>> The backend code is here:
>> https://github.com/moritz/ilbot/blob/master/lib/Ilbot/Backend/Search.pm
>>
>> For the indexing I lump together all subsequent lines by the same nick into
>> one document, and store the database IDs as a comma-separate value in a
>> second field, and the day in a third field. Each IRC channel has a separate
>> index.
>>
>> Then when displaying the search results, I retrieve all lines for that day
>> and channel from the database or cache (which is fast enough, and much
>> simpler than building complicated queries), and filter out the search
>> results, plus a few lines before and afterward for context.
>>
>> So far I'm very happy with theses tradeoffs, and like the results.
>
> It's a very nice interface.  Congratulations on a successful design. :)

Thanks. (TBH the design wasn't by me. I provided a useful but ugly 
service, and a user was sufficiently annoyed to provide a better design; 
this approach has worked several times for me in the open source 
community :-)

> I wonder whether you might consider making the "line" field stored and
> highlightable.
>
>        my $type = Lucy::Plan::FullTextType->new(
>            analyzer => $polyanalyzer,
> -         stored => 0,
> +         highlightable => 1,
>        );
>
> I see that you've emboldened the relevant line, but you could go further and
> use the Highlighter to emphasize the keywords that were searched for.
>
>      http://lucy.apache.org/docs/perl/Lucy/Docs/Tutorial/Highlighter.html
>      http://lucy.apache.org/docs/perl/Lucy/Highlight/Highlighter.html
>
> By default, the Highlighter surrounds keywords with `<strong>` tags, but using
> set_pre_tag() and set_post_tag() you can make it use a span with CSS, <blink>
> tags, or whatever.

Thanks for the comment.
I'm well aware of the highlighting feature. The reason I don't use it is 
that (although it's not obvious from the example search I've linked to), 
there is a big amount of processing going on (escaping HTML, 
automatically turning URLs into links, inserting zero-width, breaking 
spaces into long words to prevent horizontal scrolling, ...), and I 
couldn't quite figure out how to mix my own processing with the 
hilighting from Lucy.

For my purposes it would be much nicer to obtain indexes into the string 
where the search term was found, (or alternatively, let me set separate 
callbacks for both the context and the search results) so that I could 
do my own processing with that information.

(I guess I could generate a unique string that doesn't yet appear in the 
string, set it as pre/post tag, and then split on that, but that feels 
very backwards).

Cheers,
Moritz

Re: [lucy-dev] Thank you, and site built with Lucy

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Jul 15, 2013 at 9:23 AM, Moritz Lenz <mo...@faui2k3.org> wrote:
> this might be slightly off-topic here, but still I'd like to thank all
> contributors for their amazing work on Lucy.
>
> Recently I've built a search feature for my web IRC logs with Lucy, and I
> found the (Perl-)API nice to use, and the error messages very helpful too.
> (Except that one time when I managed to make lucy + perl 5.10.1 segfault;
> sadly I didn't manage to reproduce it).
>
> So, please keep up the good work!

Thank you for the feedback -- I'm glad you've had a good experience!

> Some details:
>
> An example search can be found here:
> http://irclog.perlgeek.de/perl6/search/?nick=timtoady&q=threads
> The backend code is here:
> https://github.com/moritz/ilbot/blob/master/lib/Ilbot/Backend/Search.pm
>
> For the indexing I lump together all subsequent lines by the same nick into
> one document, and store the database IDs as a comma-separate value in a
> second field, and the day in a third field. Each IRC channel has a separate
> index.
>
> Then when displaying the search results, I retrieve all lines for that day
> and channel from the database or cache (which is fast enough, and much
> simpler than building complicated queries), and filter out the search
> results, plus a few lines before and afterward for context.
>
> So far I'm very happy with theses tradeoffs, and like the results.

It's a very nice interface.  Congratulations on a successful design. :)

I wonder whether you might consider making the "line" field stored and
highlightable.

      my $type = Lucy::Plan::FullTextType->new(
          analyzer => $polyanalyzer,
-         stored => 0,
+         highlightable => 1,
      );

I see that you've emboldened the relevant line, but you could go further and
use the Highlighter to emphasize the keywords that were searched for.

    http://lucy.apache.org/docs/perl/Lucy/Docs/Tutorial/Highlighter.html
    http://lucy.apache.org/docs/perl/Lucy/Highlight/Highlighter.html

By default, the Highlighter surrounds keywords with `<strong>` tags, but using
set_pre_tag() and set_post_tag() you can make it use a span with CSS, <blink>
tags, or whatever.

The tradeoff is that your indexes would take up more space.

Marvin Humphrey