You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Grant McLean <gr...@catalyst.net.nz> on 2011/07/11 05:28:23 UTC

[lucy-user] Indexing HTML documents

Hi all

I'm just getting started with trying out Lucy. Installation went without
a hitch and I've successfully worked my way through the tutorials.
Congratulations on getting the project to this level of quality.

My main interest is indexing HTML documents for web sites.  It seems
that if I feed the HTML file contents to the Lucy indexer, all the
markup (tags and attributes) ends up in the index and consequently comes
back out in the highlighted excerpts. Is it my responsibility to strip
the tags out before passing the text to the indexer? Or is there a
simple option I can enable somewhere to have this happen automatically?

Regards
Grant

Re: [lucy-user] Indexing HTML documents

Posted by Peter Karman <pe...@peknet.com>.

Grant McLean wrote on 7/10/11 10:28 PM:
> Hi all
> 
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.
> Congratulations on getting the project to this level of quality.
> 
> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer? Or is there a
> simple option I can enable somewhere to have this happen automatically?
> 

Consider using Swish3 with the Lucy backend.

http://search.cpan.org/dist/SWISH-Prog-Lucy/

If you install SWISH::Prog::Lucy you'll get the swish3 cli with which you can
easily index .html, .xml, .pdf, .doc, .xls, .txt, etc.

Example:

index docs:
 % swish3 -F lucy -i path/to/html/files

search docs:
 % swish3 -q 'some query'

Since the index created is a standard Lucy index, you can search it with the
relevant Lucy classes, or use the SWISH::Prog::Lucy::Searcher wrapper (which
automatically refreshes the index handle when the index is updated).

See also the new Dezi REST server if you want to put a web service in front of
your Lucy index, like Solr:

 http://search.cpan.org/dist/Dezi

Docs are still a bit sparse; get in touch if you're interested in helping flesh
them out.



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Indexing HTML documents

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jul 12, 2011 at 05:54:35PM +0200, Jens Krämer wrote:
> On 12.07.2011, at 09:39, arjan wrote:
> > What you could do to match words with and without accents is adding an
> > extra field for the content without accents. There are perl modules
> > available to replace accented characters. This is called "normalization
> > form d".
> 
> Wouldn't doing so break the highlighting of matching terms because the hit
> for 'cafe' then would occur in the normalized field, but not in the 'main'
> field that most probably would be used for showing the excerpt?

Yes, that's right.

> I don't know Lucy (yet ;-) but I've done lots of work with Lucene and
> Ferret, and there I usually normalize accented characters (and german
> umlauts) with a special token filter that's part of a custom analyzer.
 
This is technically possible, though Lucy's Analyzer subclassing API has been
temporarily redacted in anticipation of refactoring, and so is not officially
supported or documented.

  package NormDAnalyzer;
  use base qw( Lucy::Analysis::Analyzer;
  use Unicode::Normalize qw( normalize );

  sub transform {
    my ($self, $inversion) = @_;
    while (my $token = $inversion->next) {
      $token->set_text(normalize('D', $token->get_text)); 
    }
    $inversion->reset;
    return $inversion;
  }

> Imho treating 'é' like 'e' should be no harder than treating 'E' like 'e',
> so I'd say going where your tokens are being downcased and hooking in there
> to additionally perform more normalizations should be the way to go. But as
> I said, I have no idea if and how this is possible in Lucy...

I agree.  I'd just been putting off the Big Discussion about the Analysis
chain ;) ... and I'd somehow missed that highlighting was part of Grant's
requirements.

Good catch, Jens.

Marvin Humphrey

Re: [lucy-user] Indexing HTML documents

Posted by Jens Krämer <jk...@jkraemer.net>.

Hi!

On 12.07.2011, at 09:39, arjan wrote:
> 
> What you could do to match words with and without accents is adding an extra field for the content without accents. There are perl modules available to replace accented characters. This is called "normalization form d".

Wouldn't doing so break the highlighting of matching terms because the hit for 'cafe' then would occur in the normalized field, but not in the 'main' field that most probably would be used for showing the excerpt?

I don't know Lucy (yet ;-) but I've done lots of work with Lucene and Ferret, and there I usually normalize accented characters (and german umlauts) with a special token filter that's part of a custom analyzer.

Imho treating 'é' like 'e' should be no harder than treating 'E' like 'e', so I'd say going where your tokens are being downcased and hooking in there to additionally perform more normalizations should be the way to go. But as I said, I have no idea if and how this is possible in Lucy...

Cheers,
Jens

> On 12-07-11 07:28, Grant McLean wrote:
>> On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
>>> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
[..]
>> 
>> The final issue I'd like to tackle is the handling of accents.  Ideally
>> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
>> should be able to type a query with-or-without the accent and match
>> documents with-or-without the accent and have the excerpt highlighting
>> pick up words with-or-without the accent.  I would prefer not to have
>> the search results and excerpts lacking accents if they are present in
>> the source document.  Is this dream scenario possible?  Perhaps with
>> synonyms?  Can anyone suggest an approach?
>> 
>> Thanks
>> Grant
>> 

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/

Re: [lucy-user] Indexing HTML documents

Posted by arjan <ar...@unitedknowledge.nl>.

Hi Grant,

What you could do to match words with and without accents is adding an 
extra field for the content without accents. There are perl modules 
available to replace accented characters. This is called "normalization 
form d".

Kind regards,
Arjan.

On 12-07-11 07:28, Grant McLean wrote:
> On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
>> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
>>> My main interest is indexing HTML documents for web sites.  It seems
>>> that if I feed the HTML file contents to the Lucy indexer, all the
>>> markup (tags and attributes) ends up in the index and consequently comes
>>> back out in the highlighted excerpts. Is it my responsibility to strip
>>> the tags out before passing the text to the indexer?
>> You have to handle document parsing yourself and supply plain text to Lucy.
> I was guessing that was probably the case.  I ended up using the
> HTML::Strip module from CPAN and apart from a strange encoding issue
> when it used HTML::Entities for entity expansion, it seems to have
> worked reasonably well.
>
> I'm now interested tuning my setup for better quality search results.
> My current application cannot assume a sophisticated user base - they
> just want to bang a word or phrase into the search box and hit go.
>
> The first thing I did that improved the results was to rewrite a raw
> query string like this
>
>      Votes for Women
>
> into this:
>
>      (vote AND for AND women) OR ("votes for women")
>
> and pass the result to the query parser.
>
> Initially I found that doing a phrase search by wrapping double quotes
> around the words didn't seem to make any difference to the results.
> This seemed to be because the phrases I was using all contained
> stopwords and I had indexed using a PolyAnalyser with
> SnowballStopFilter.
>
> The next improvement I made was to index the document title field
> without using a stopword filter (I left the filter on for the document
> body) and also add a 'boost =>  5' to the type definition for the title
> field.
>
> This resulted in a more manageable number of hits and better ranking.
>
> I also tried building my own query objects and combining them with
> ORQuery and using 'boost' values for queries on important fields.  This
> exercise was largely fruitless.  In the cases where I managed to get any
> results at all, the effect of boost seemed to be exactly the opposite of
> what I expected - a larger boost led to a smaller score.  That might be
> because the bits of the query that matched weren't actually the ones I
> expected.  If this is a valid area for people to explore, it might be
> worth adding a working example or two to the documentation.
>
> So I now have a setup that works reasonably well and gives sensible
> rankings.
>
> The final issue I'd like to tackle is the handling of accents.  Ideally
> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
> should be able to type a query with-or-without the accent and match
> documents with-or-without the accent and have the excerpt highlighting
> pick up words with-or-without the accent.  I would prefer not to have
> the search results and excerpts lacking accents if they are present in
> the source document.  Is this dream scenario possible?  Perhaps with
> synonyms?  Can anyone suggest an approach?
>
> Thanks
> Grant
>
>
>
>


-- 
NIEUW: http://www.mediacalculator.unitedknowledge.nl/

Hoe verslaan de media het politieke nieuws? Wie haalt het nieuws en hoe werkt dat uit? Bekijk het in de MediaCalculator: mediacalculator.unitedknowledge.nl

Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk Overleg Milieuhandhaving

Setting Standards, a Delft University of Technology and United Knowledge simulation exercise on strategy and cooperation in standardization, http://www.setting-standards.com

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
bureau@unitedknowledge.nl
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E arjan@unitedknowledge.nl

Bezoek onze site op:
http://www.unitedknowledge.nl

Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/

Re: [lucy-user] Indexing HTML documents

Posted by Grant McLean <gr...@catalyst.net.nz>.

On Tue, 2011-07-12 at 08:04 -0700, Marvin Humphrey wrote:
> On Tue, Jul 12, 2011 at 05:28:30PM +1200, Grant McLean wrote:
> > I was guessing that was probably the case.  I ended up using the
> > HTML::Strip module from CPAN and apart from a strange encoding issue
> > when it used HTML::Entities for entity expansion, it seems to have
> > worked reasonably well.
> 
> Did you get that encoding issue licked?  If not, can you reproduce it?

Yes, I worked around it.  I'm not sure I can really pin the blame on
HTML::Entities. The thing that threw me was it seems that as long as the
entities being expanded fit in the U+0080 - U+00FF range the return
value is a byte string rather than a UTF-8 character string.  As soon as
it expands an entity beyond U+00FF, the returned values are character
strings with the UTF-8 flag set.  I think it was this inconsistent
behaviour that exposed a bad assumption in another part of my code.

> > The first thing I did that improved the results was to rewrite a raw
> > query string like this
> > 
> >     Votes for Women
> > 
> > into this:
> > 
> >     (vote AND for AND women) OR ("votes for women")
> > 
> > and pass the result to the query parser.
> 
> You might also experiment with using LucyX::Search::ProximityQuery instead of
> PhraseQuery for the supplementary clause.

I would definitely be interested to do that but I'm not sure how to go
about it.  I built a query object like this:

    my $proximity_query = LucyX::Search::ProximityQuery->new( 
        field  => 'content',
        terms  => \@words,
        within => 10,    # match within 10 positions
    );

But $searcher->hits( query => $proximity_query ) never seems to return
me any matches at all no matter what words I feed it.  Does it need to
be combined with another query?

> You may find the following trick useful for inspecting queries produced by
> QueryParser:
> 
>     use Data::Dumper qw( Dumper );
>     my $query = $query_parser->parse($query_string);
>     warn Dumper($query->dump);

That's definitely very interesting.  I'll have a closer look at that
after a good night's sleep :-)

> > I also tried building my own query objects and combining them with
> > ORQuery and using 'boost' values for queries on important fields.  
> > ... If this is a valid area for people to explore, it
> > might be worth adding a working example or two to the documentation.
> 
> I like the idea , but I'm not sure exactly where to work this in.

I think further study of the dumped query will probably move me in the
right direction.

> > The final issue I'd like to tackle is the handling of accents.  Ideally
> > I'd like to be able to treat 'cafe' and 'café' as equivalent.
> 
> Arjan's suggestion is the way to go.  Thanks, Arjan!

What I've done is very similar to Arjan's suggestion.  I've actually
appended a normalised copy of the full text to the content field.  This
has the advantage that the highlighter can chose the part of the text
that includes or omits the accents as appropriate.

One new area I'd like to explore is the idea of assigning a page rank to
each document that could be used as a multiplier in the ranking.  In the
collection of documents I'm working with, the following rules apply:

 * documents with many incoming links are 'better'
 * recent documents are better than ones with older publication dates
 * longer documents are somewhat 'better' than shorter ones (or to
   put it another way we have quite a few very short documents that are
   definitely 'worse')

I'll do some more reading and see if I can work out where these sort of
rules might be applied.

Thanks everyone for the useful suggestions.

Regards
Grant

Re: [lucy-user] Indexing HTML documents

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jul 12, 2011 at 05:28:30PM +1200, Grant McLean wrote:
> I was guessing that was probably the case.  I ended up using the
> HTML::Strip module from CPAN and apart from a strange encoding issue
> when it used HTML::Entities for entity expansion, it seems to have
> worked reasonably well.

Did you get that encoding issue licked?  If not, can you reproduce it?

I've debugged encoding issues with HTML::Entities before.  Older versions had
a lot of problems, but newer ones are much better behaved -- so you might try
upgrading if you haven't already.

> The first thing I did that improved the results was to rewrite a raw
> query string like this
> 
>     Votes for Women
> 
> into this:
> 
>     (vote AND for AND women) OR ("votes for women")
> 
> and pass the result to the query parser.

You might also experiment with using LucyX::Search::ProximityQuery instead of
PhraseQuery for the supplementary clause.

> Initially I found that doing a phrase search by wrapping double quotes
> around the words didn't seem to make any difference to the results.
> This seemed to be because the phrases I was using all contained
> stopwords and I had indexed using a PolyAnalyser with
> SnowballStopFilter.

Using a SnowballStopFilter strips the stopwords out of the token array:

    "votes for women" => "votes women"

Stoplists have certain advantages, particularly in terms of shrinking index
size, but they can have detrimental effects on recall, particularly when you
need to search for something like '"The Smiths"' and your search returns
everything that contains 'smith'.  Also, Lucy's scoring model tends to
diminish the impact of common terms:

    http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/IRTheory.html#TF-IDF-ranking-algorithm

    ... in a search for skate park, documents which score well for the
    comparatively rare term 'skate' will rank higher than documents which score
    well for the more common term 'park'.

You may find the following trick useful for inspecting queries produced by
QueryParser:

    use Data::Dumper qw( Dumper );
    my $query = $query_parser->parse($query_string);
    warn Dumper($query->dump);

> I also tried building my own query objects and combining them with
> ORQuery and using 'boost' values for queries on important fields.  This
> exercise was largely fruitless.  In the cases where I managed to get any
> results at all, the effect of boost seemed to be exactly the opposite of
> what I expected - a larger boost led to a smaller score.

Scores are not absolute -- they are only meaningful as relative measures
within the context of a single search.

> That might be because the bits of the query that matched weren't actually
> the ones I expected.  If this is a valid area for people to explore, it
> might be worth adding a working example or two to the documentation.

I like the idea , but I'm not sure exactly where to work this in.  It doesn't
belong in the reference documentation for the individual classes.  Instead it
should go in application documentation under Lucy::Docs -- but I don't think
there's an appropriate article there yet.

Perhaps we could use an article on tuning scoring, Lucy::Docs::Tuning or
something like that.

> So I now have a setup that works reasonably well and gives sensible
> rankings.

:) 

> The final issue I'd like to tackle is the handling of accents.  Ideally
> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
> should be able to type a query with-or-without the accent and match
> documents with-or-without the accent and have the excerpt highlighting
> pick up words with-or-without the accent.  I would prefer not to have
> the search results and excerpts lacking accents if they are present in
> the source document.  Is this dream scenario possible?  Perhaps with
> synonyms?  Can anyone suggest an approach?

Arjan's suggestion is the way to go.  Thanks, Arjan!

Marvin Humphrey

Re: [lucy-user] Indexing HTML documents

Posted by Grant McLean <gr...@catalyst.net.nz>.

On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
> > My main interest is indexing HTML documents for web sites.  It seems
> > that if I feed the HTML file contents to the Lucy indexer, all the
> > markup (tags and attributes) ends up in the index and consequently comes
> > back out in the highlighted excerpts. Is it my responsibility to strip
> > the tags out before passing the text to the indexer?
> 
> You have to handle document parsing yourself and supply plain text to Lucy.

I was guessing that was probably the case.  I ended up using the
HTML::Strip module from CPAN and apart from a strange encoding issue
when it used HTML::Entities for entity expansion, it seems to have
worked reasonably well.

I'm now interested tuning my setup for better quality search results.
My current application cannot assume a sophisticated user base - they
just want to bang a word or phrase into the search box and hit go.

The first thing I did that improved the results was to rewrite a raw
query string like this

    Votes for Women

into this:

    (vote AND for AND women) OR ("votes for women")

and pass the result to the query parser.

Initially I found that doing a phrase search by wrapping double quotes
around the words didn't seem to make any difference to the results.
This seemed to be because the phrases I was using all contained
stopwords and I had indexed using a PolyAnalyser with
SnowballStopFilter.

The next improvement I made was to index the document title field
without using a stopword filter (I left the filter on for the document
body) and also add a 'boost => 5' to the type definition for the title
field.

This resulted in a more manageable number of hits and better ranking.

I also tried building my own query objects and combining them with
ORQuery and using 'boost' values for queries on important fields.  This
exercise was largely fruitless.  In the cases where I managed to get any
results at all, the effect of boost seemed to be exactly the opposite of
what I expected - a larger boost led to a smaller score.  That might be
because the bits of the query that matched weren't actually the ones I
expected.  If this is a valid area for people to explore, it might be
worth adding a working example or two to the documentation.

So I now have a setup that works reasonably well and gives sensible
rankings.

The final issue I'd like to tackle is the handling of accents.  Ideally
I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
should be able to type a query with-or-without the accent and match
documents with-or-without the accent and have the excerpt highlighting
pick up words with-or-without the accent.  I would prefer not to have
the search results and excerpts lacking accents if they are present in
the source document.  Is this dream scenario possible?  Perhaps with
synonyms?  Can anyone suggest an approach?

Thanks
Grant

Re: [lucy-user] Indexing HTML documents

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.

Nice...

> Congratulations on getting the project to this level of quality.

Thanks!  :)

> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer?

You have to handle document parsing yourself and supply plain text to Lucy.

Lucy is a specialized fulltext indexing library rather than a turnkey indexing
solution, so it does not bundle file-format-specific parsing tools.  Instead,
it is designed so that it may serve as the indexing component within a larger
system which aggregates additional components such as parsers.

At this point I would ordinarily suggest a variety of HTML parsing CPAN
distributions, but presuming that you are the Grant McLean who maintains
XML::Simple and XML::SAX, I imagine that you are familiar with the lay of the
land.  :)

Marvin Humphrey

[lucy-dev] Re: [lucy-user] Indexing HTML documents

Posted by Peter Karman <pe...@peknet.com>.


Grant McLean wrote on 7/10/11 10:28 PM:
> Hi all
> 
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.
> Congratulations on getting the project to this level of quality.
> 
> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer? Or is there a
> simple option I can enable somewhere to have this happen automatically?
> 

Consider using Swish3 with the Lucy backend.

http://search.cpan.org/dist/SWISH-Prog-Lucy/

If you install SWISH::Prog::Lucy you'll get the swish3 cli with which you can
easily index .html, .xml, .pdf, .doc, .xls, .txt, etc.

Example:

index docs:
 % swish3 -F lucy -i path/to/html/files

search docs:
 % swish3 -q 'some query'

Since the index created is a standard Lucy index, you can search it with the
relevant Lucy classes, or use the SWISH::Prog::Lucy::Searcher wrapper (which
automatically refreshes the index handle when the index is updated).

See also the new Dezi REST server if you want to put a web service in front of
your Lucy index, like Solr:

 http://search.cpan.org/dist/Dezi

Docs are still a bit sparse; get in touch if you're interested in helping flesh
them out.



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com