You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Moritz Lenz <mo...@faui2k3.org> on 2011/02/20 19:01:46 UTC

[lucy-dev] Schema for searching IRC logs

(originally I sent this mail to the kinosearch mailing list, but since
it's temporarily down Marvin suggested I send this to lucy-dev instead.
Please excuse me if it's not quite on topic here).

Hi all,

I've been running public IRC logs for a few years now, and have decided
to replace the crappy search with something decent. So, KinoSearch it is :-)

One page of these logs contains the conversation from one channel at one
particular day, and each such page contains many rows consisting of an
ID, a timestamp, a nickname, and the line that was being uttered.
Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
about 20 channels, a few years worth of logs and 4 million rows; I want
to be able to scale up to maybe 20 million rows)

I want my search results to be grouped similarly, so my current schema
looks like this:

my $schema      = KinoSearch::Plan::Schema->new;
my $poly_an     = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
my $full_text   = KinoSearch::Plan::FullTextType->new(
                    analyzer => $poly_an,
                    stored   => 0,
                  );
my $string      = KinoSearch::Plan::StringType->new( stored => 0);
my $kept_string = KinoSearch::Plan::StringType->new( stored => 1,
sortable => 1);
my $sort_string = KinoSearch::Plan::StringType->new( stored => 0,
sortable => 1);

$schema->spec_field(name => 'line',     type => $full_text);
$schema->spec_field(name => 'nick',     type => $string);
$schema->spec_field(name => 'channel',  type => $kept_string);
$schema->spec_field(name => 'day',      type => $kept_string);
$schema->spec_field(name => 'timestamp',type => $sort_string);
$schema->spec_field(name => 'id',       type => $kept_string);

Having each line as a separate document has three disadvantages:

1) when displaying the results, I have to construct the context manually
(so I need to hit the DB to get the rows before and after, which is why
I don't store the line in the index)

2) when paging the search results, I rip apart the last page, because
the num_wanted option works with rows, not pages.

3) not sure about this one, but it feels that this solution doesn't
scale well. I've wait more than half a minute for a query that was
limited to 100 rows. (Mabe my three sort_specs hurt here?)

Is there a way to construct my schema in a way to avoid these problems
(and still allows searching by field)? Something like sub-documents,
where I have pages as top level documents, and each page can have
multiple rows?

Cheers,
Moritz

Re: [lucy-dev] Schema for searching IRC logs

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Moritz,

On Feb 27, 2011, at 2:06 AM, Moritz Lenz wrote:
> 
> In the mean time I'd like to thank everybody for their helpful input and
> for the work on KinoSearch/lucy.

No THANK YOU for actually using it and finding value in it. We appreciate that!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [lucy-dev] Schema for searching IRC logs

Posted by Moritz Lenz <mo...@faui2k3.org>.

On 02/21/2011 12:53 AM, Marvin Humphrey wrote:
> On Sun, Feb 20, 2011 at 07:01:46PM +0100, Moritz Lenz wrote:
>> Having each line as a separate document has three disadvantages:
>> 
>> 1) when displaying the results, I have to construct the context manually
>> (so I need to hit the DB to get the rows before and after, which is why
>> I don't store the line in the index)
> 
> You could theoretically store the entire page with each line.  However, that
> would waste space thanks to the redundancy, and so it's probably better to
> store the pages in a separate data structure (RDBMS, Berkeley DB, etc), and
> retrieve them after getting the results back from the index.
> 
> Alternately, consider providing less context: just the lines before and after.

Makes sense, thank you.

>> 2) when paging the search results, I rip apart the last page, because
>> the num_wanted option works with rows, not pages.
> 
> I don't quite grok what you mean.  However, I can certainly see how there
> would be difficulties if you want to display results broken up by "page", but
> your engine returns results broken up by "line": you'll have to post-process
> the line-based hits to collate the page-based results.  That gets very
> complicated as soon as you go past the first SERP.  (SERP = "Search Engine
> Results Page", distinct from how you're using the word "page" to describe IRC
> log content).

That's exactly what I meant, I just didn't describe it good enough.

>> 3) not sure about this one, but it feels that this solution doesn't
>> scale well. I've wait more than half a minute for a query that was
>> limited to 100 rows. 
> 
> The two fundamentals when optimizing for search speed are RAM and posting list
> size.
> 
> First, you need enough RAM on the box to fit all of the important index
> components (lexicons, posting lists, and sort caches) into the OS cache.  With
> millions of records, you are really going to feel it you are hitting the hard
> disk.
> 
> Second, slow queries are almost always slow because some part of the query
> matches a very large number of documents -- or to put it into our native
> terminology, at least one term has a very large posting list. 

That is certainly the case with my slow queries. In fact one example I
remember were two AND-connected terms that would both produce quite much
output when run separately.

>> (Mabe my three sort_specs hurt here?)
> 
> Almost certainly not.  Search-time sorting in Lucy/KinoSearch is very fast; we
> spend a fair amount of effort building up optimized data structures to support
> sorting at index-time.  Thanks to that approach, if anything, the costs of
> making additional fields sortable are felt at index-time, not search-time.

Great.

>> Is there a way to construct my schema in a way to avoid these problems
>> (and still allows searching by field)? Something like sub-documents,
>> where I have pages as top level documents, and each page can have
>> multiple rows?
> 
> If I understand correctly, there seem to be inherent difficulties with the
> one-to-many relationships in that approach.
> 
> If you organize documents by "page", and each "page" has multiple values for
> the 'nick' field, you are going to get false positives when filtering by
> 'nick'.  For instance, if both "chromatic" and "moritz" have authored lines on
> a given page, then a filter on "moritz" will fail to exclude nearby content
> authored by "chromatic".
> 
> Similarly, if you organize documents by page, each now has multiple
> 'timestamp' values.  How do you know which line within the page caused the
> hit, and thus which associated timestamp the result should sort by?
> 
> I think the only way to achieve the ideal logical result you've described is
> to organize the index with "line" as the top-level document.  However, there
> is no question that organizing by "page" would drastically cut down the size
> of the posting lists which are being iterated, improving search speed.
> 
> Would it be acceptable to modify the spec?
> 
>   * Pages are top-level documents.
>   * Each page is associated with one timestamp -- the time of the first line.
>   * No page can cross multiple days.
>   * Pages can have multiple values for 'nick', so filtering on 'nick' limits
>     results to pages that an author has participated in, rather than lines
>     they've written.

I fear that this makes the search too unspecific. I often use the search
when I have a vague memory like "didn't TimToday say something about the
problem with submethod BUILD?", but since he's active almost every day,
narrowing down the search by nick name would lose nearly all value.

I'll continue pondering the problem and trying out things, maybe I'll
find a better solution.

In the mean time I'd like to thank everybody for their helpful input and
for the work on KinoSearch/lucy.

Cheers,
Moritz

Re: [lucy-dev] Schema for searching IRC logs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sun, Feb 20, 2011 at 07:01:46PM +0100, Moritz Lenz wrote:
> Having each line as a separate document has three disadvantages:
> 
> 1) when displaying the results, I have to construct the context manually
> (so I need to hit the DB to get the rows before and after, which is why
> I don't store the line in the index)

You could theoretically store the entire page with each line.  However, that
would waste space thanks to the redundancy, and so it's probably better to
store the pages in a separate data structure (RDBMS, Berkeley DB, etc), and
retrieve them after getting the results back from the index.

Alternately, consider providing less context: just the lines before and after.

> 2) when paging the search results, I rip apart the last page, because
> the num_wanted option works with rows, not pages.

I don't quite grok what you mean.  However, I can certainly see how there
would be difficulties if you want to display results broken up by "page", but
your engine returns results broken up by "line": you'll have to post-process
the line-based hits to collate the page-based results.  That gets very
complicated as soon as you go past the first SERP.  (SERP = "Search Engine
Results Page", distinct from how you're using the word "page" to describe IRC
log content).

> 3) not sure about this one, but it feels that this solution doesn't
> scale well. I've wait more than half a minute for a query that was
> limited to 100 rows. 

The two fundamentals when optimizing for search speed are RAM and posting list
size.

First, you need enough RAM on the box to fit all of the important index
components (lexicons, posting lists, and sort caches) into the OS cache.  With
millions of records, you are really going to feel it you are hitting the hard
disk.

Second, slow queries are almost always slow because some part of the query
matches a very large number of documents -- or to put it into our native
terminology, at least one term has a very large posting list.  Even if the
complete query doesn't match very many documents, it's possible that a
sub-section of the query is slowing things down.  Thus, the process of query
optimization generally involves finding ways to match fewer documents.

Try running this code and see if anything stands out as likely to produce a
large result set:

    use Data::Dumper;
    my $query = $query_parser->parse($query_string);
    warn Dumper($query->dump);

> (Mabe my three sort_specs hurt here?)

Almost certainly not.  Search-time sorting in Lucy/KinoSearch is very fast; we
spend a fair amount of effort building up optimized data structures to support
sorting at index-time.  Thanks to that approach, if anything, the costs of
making additional fields sortable are felt at index-time, not search-time.

> Is there a way to construct my schema in a way to avoid these problems
> (and still allows searching by field)? Something like sub-documents,
> where I have pages as top level documents, and each page can have
> multiple rows?

If I understand correctly, there seem to be inherent difficulties with the
one-to-many relationships in that approach.

If you organize documents by "page", and each "page" has multiple values for
the 'nick' field, you are going to get false positives when filtering by
'nick'.  For instance, if both "chromatic" and "moritz" have authored lines on
a given page, then a filter on "moritz" will fail to exclude nearby content
authored by "chromatic".

Similarly, if you organize documents by page, each now has multiple
'timestamp' values.  How do you know which line within the page caused the
hit, and thus which associated timestamp the result should sort by?

I think the only way to achieve the ideal logical result you've described is
to organize the index with "line" as the top-level document.  However, there
is no question that organizing by "page" would drastically cut down the size
of the posting lists which are being iterated, improving search speed.

Would it be acceptable to modify the spec?

  * Pages are top-level documents.
  * Each page is associated with one timestamp -- the time of the first line.
  * No page can cross multiple days.
  * Pages can have multiple values for 'nick', so filtering on 'nick' limits
    results to pages that an author has participated in, rather than lines
    they've written.

Marvin Humphrey

Re: [lucy-dev] Schema for searching IRC logs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sun, Feb 20, 2011 at 10:46:33PM +0100, Moritz Lenz wrote:
> Is the kinosearch list still in use then? If yes, what for?

Once we get an initial release of Lucy out, none of this will matter.  When
Lucy entered the Incubator, we'd hoped to make a release quickly enough that
it wouldn't be necessary to plan for the closure of KS resources prior to the
existence of an official Lucy tarball -- we'd release Lucy, then deal with
deprecating KS after that.  To be frank, I didn't think it was going to take
this long to resolve all of the dependency licensing issues that are currently
holding us up, but lots of commits are going in and we'll get there.

> > As for your issues below, why not aggregate all lines with a particular
> > user (and set of timestamps) into a single Document with multi-valued
> > fields for timestamp and for line? Would that help?
> 
> I haven't come across multi-valued fields yet. Where are they documented?

We don't have "multi-valued fields" -- that's a Lucene-only thing.  I strongly
dislike that quirk of Lucene and consider it a deceptive misfeature.  For
example, it takes a while for users to realize that you don't get multi-valued
sorting with Lucene's multi-valued fields (the first term is used, even if it
wasn't the term that matched).

You can fake up something like a multi-valued field in Lucy/KS using custom
tokenization:

  my $pipe_splitter = Lucy::Analysis::Tokenizer->new(pattern => '[^|]+');
  my $field_type = Lucy::Plan::FullTextType->new(analyzer => $pipe_splitter);
  $schema->spec_field(name => 'nick', type => $field_type);
  ...
  $doc->{nick} = join('|', @nicks);

Note that if you make that "nick" field sortable, Lucy/KS will use the
*entire* string field value to determine sort order.  In other words, if the
value for a document is "chromatic|moritz" and the term "moritz" matches, it
will still be sorted by chromatic's nick first.

Marvin Humphrey

Re: [lucy-dev] Schema for searching IRC logs

Posted by Moritz Lenz <mo...@faui2k3.org>.

Hello Chris,

thanks for your swift reply.

On 02/20/2011 07:33 PM, Mattmann, Chris A (388J) wrote:

> Thanks for your email. I would suggest in general that Lucy *is* the place that you should come to for KinoSearch support since Apache Lucy is now where the developers of KinoSearch are, and since Apache Lucy is what KinoSearch has evolved into.

Works for me.
Is the kinosearch list still in use then? If yes, what for?

> As for your issues below, why not aggregate all lines with a particular user (and set of timestamps) into a single Document with multi-valued fields for timestamp and for line? Would that help?

I haven't come across multi-valued fields yet. Where are they documented?

Also if I put all lines from one user into a Document, I still have to
manually reconstruct the context (that's not too bad, but not optimal
either). Also will I be able to retrieve the ID of a found line somehow?

Cheers,
Moritz

> Cheers,
> Chris
> 
> On Feb 20, 2011, at 10:01 AM, Moritz Lenz wrote:
> 
>> (originally I sent this mail to the kinosearch mailing list, but since
>> it's temporarily down Marvin suggested I send this to lucy-dev instead.
>> Please excuse me if it's not quite on topic here).
>> 
>> Hi all,
>> 
>> I've been running public IRC logs for a few years now, and have decided
>> to replace the crappy search with something decent. So, KinoSearch it is :-)
>> 
>> One page of these logs contains the conversation from one channel at one
>> particular day, and each such page contains many rows consisting of an
>> ID, a timestamp, a nickname, and the line that was being uttered.
>> Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
>> about 20 channels, a few years worth of logs and 4 million rows; I want
>> to be able to scale up to maybe 20 million rows)
>> 
>> I want my search results to be grouped similarly, so my current schema
>> looks like this:
>> 
>> my $schema      = KinoSearch::Plan::Schema->new;
>> my $poly_an     = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
>> my $full_text   = KinoSearch::Plan::FullTextType->new(
>>                    analyzer => $poly_an,
>>                    stored   => 0,
>>                  );
>> my $string      = KinoSearch::Plan::StringType->new( stored => 0);
>> my $kept_string = KinoSearch::Plan::StringType->new( stored => 1,
>> sortable => 1);
>> my $sort_string = KinoSearch::Plan::StringType->new( stored => 0,
>> sortable => 1);
>> 
>> $schema->spec_field(name => 'line',     type => $full_text);
>> $schema->spec_field(name => 'nick',     type => $string);
>> $schema->spec_field(name => 'channel',  type => $kept_string);
>> $schema->spec_field(name => 'day',      type => $kept_string);
>> $schema->spec_field(name => 'timestamp',type => $sort_string);
>> $schema->spec_field(name => 'id',       type => $kept_string);
>> 
>> Having each line as a separate document has three disadvantages:
>> 
>> 1) when displaying the results, I have to construct the context manually
>> (so I need to hit the DB to get the rows before and after, which is why
>> I don't store the line in the index)
>> 
>> 2) when paging the search results, I rip apart the last page, because
>> the num_wanted option works with rows, not pages.
>> 
>> 3) not sure about this one, but it feels that this solution doesn't
>> scale well. I've wait more than half a minute for a query that was
>> limited to 100 rows. (Mabe my three sort_specs hurt here?)
>> 
>> Is there a way to construct my schema in a way to avoid these problems
>> (and still allows searching by field)? Something like sub-documents,
>> where I have pages as top level documents, and each page can have
>> multiple rows?
>> 
>> Cheers,
>> Moritz
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: [lucy-dev] Schema for searching IRC logs

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Moritz,

Thanks for your email. I would suggest in general that Lucy *is* the place that you should come to for KinoSearch support since Apache Lucy is now where the developers of KinoSearch are, and since Apache Lucy is what KinoSearch has evolved into.

As for your issues below, why not aggregate all lines with a particular user (and set of timestamps) into a single Document with multi-valued fields for timestamp and for line? Would that help?

Cheers,
Chris

On Feb 20, 2011, at 10:01 AM, Moritz Lenz wrote:

> (originally I sent this mail to the kinosearch mailing list, but since
> it's temporarily down Marvin suggested I send this to lucy-dev instead.
> Please excuse me if it's not quite on topic here).
> 
> Hi all,
> 
> I've been running public IRC logs for a few years now, and have decided
> to replace the crappy search with something decent. So, KinoSearch it is :-)
> 
> One page of these logs contains the conversation from one channel at one
> particular day, and each such page contains many rows consisting of an
> ID, a timestamp, a nickname, and the line that was being uttered.
> Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
> about 20 channels, a few years worth of logs and 4 million rows; I want
> to be able to scale up to maybe 20 million rows)
> 
> I want my search results to be grouped similarly, so my current schema
> looks like this:
> 
> my $schema      = KinoSearch::Plan::Schema->new;
> my $poly_an     = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
> my $full_text   = KinoSearch::Plan::FullTextType->new(
>                    analyzer => $poly_an,
>                    stored   => 0,
>                  );
> my $string      = KinoSearch::Plan::StringType->new( stored => 0);
> my $kept_string = KinoSearch::Plan::StringType->new( stored => 1,
> sortable => 1);
> my $sort_string = KinoSearch::Plan::StringType->new( stored => 0,
> sortable => 1);
> 
> $schema->spec_field(name => 'line',     type => $full_text);
> $schema->spec_field(name => 'nick',     type => $string);
> $schema->spec_field(name => 'channel',  type => $kept_string);
> $schema->spec_field(name => 'day',      type => $kept_string);
> $schema->spec_field(name => 'timestamp',type => $sort_string);
> $schema->spec_field(name => 'id',       type => $kept_string);
> 
> Having each line as a separate document has three disadvantages:
> 
> 1) when displaying the results, I have to construct the context manually
> (so I need to hit the DB to get the rows before and after, which is why
> I don't store the line in the index)
> 
> 2) when paging the search results, I rip apart the last page, because
> the num_wanted option works with rows, not pages.
> 
> 3) not sure about this one, but it feels that this solution doesn't
> scale well. I've wait more than half a minute for a query that was
> limited to 100 rows. (Mabe my three sort_specs hurt here?)
> 
> Is there a way to construct my schema in a way to avoid these problems
> (and still allows searching by field)? Something like sub-documents,
> where I have pages as top level documents, and each page can have
> multiple rows?
> 
> Cheers,
> Moritz


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [lucy-dev] Schema for searching IRC logs

Posted by Peter Karman <pe...@peknet.com>.

Moritz Lenz wrote on 2/20/11 12:01 PM:

> 
> One page of these logs contains the conversation from one channel at one
> particular day, and each such page contains many rows consisting of an
> ID, a timestamp, a nickname, and the line that was being uttered.
> Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
> about 20 channels, a few years worth of logs and 4 million rows; I want
> to be able to scale up to maybe 20 million rows)
> 
> I want my search results to be grouped similarly, so my current schema
> looks like this:

When I've done similar projects I eventually ask myself, what is the smallest
unit I want to represent as a "result". In this case, is it actually the row, or
the page of rows? I.e., start from the visual idea you want and work backwards.
It seems like you are doing that (you want to group your results similarly) --
what does "similar" mean? Same page? Same channel? etc.

One approach I have taken is to build multiple indexes, each with a different
unit of granularity. E.g., page-level index and a row-level index. Then my
search code first executes on the row-level index for its field-specificity, and
then pulls out displayed results from the page-level index, in order to get the
context. It's like hitting the db (as you mention) but usually faster because
the de-normalizing of the rows has already taken place at index build-time.


> 
> Is there a way to construct my schema in a way to avoid these problems
> (and still allows searching by field)? Something like sub-documents,
> where I have pages as top level documents, and each page can have
> multiple rows?
> 

I de-normalize my db to XML files, using <xinclude> tags to represent the
one-to-many relationships. So in this case, I would create a page.xml:

 <page>
  <title>....</title>
  <rows>
  <xi:include href="path/to/row1.xml" />
  ...
  </rows>
 </page>

and then create 2 indexes, one pointed at the page xml and one at the row xml.
(This is with SWISH::Prog::KSx and swish3.)

You wouldn't even have to have two indexes, if you didn't ever want to return
row-level results specifically.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com