You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Terence Gannon <bu...@gmail.com> on 2009/01/12 17:00:31 UTC

Improving Readability of Hit Highlighting

I'm indexing text from an OCR of an old document.  Many words get read
perfectly, but they're typically embedded in a lot of junk.  I would
like the hit highlighting to show only the 'good' words, in the order
in which they appeared in the original document.  Is it possible to
use output of the filter classes as the text used in hit highlighting?
 Or do you have to all the text cleanup outside of Solr and present it
with two fields to index, one with the original text, and one with the
cleaned up text.  The objective of the hit highlighting is to give the
user a *sense* of the original context, even if it's not provided
verbatim from the original document.  Thanks in advance.

TerryG

Re: Improving Readability of Hit Highlighting

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

Quick note: please include copy of previous email when replying, so people can be reminded of the context.

You mentioned junk getting highlighted.  In your case is CONTRACTORINMPRIMENTAYIVE getting highlighted?  And that is junk?    If so, why not augment your indexing to throw out junk tokens if you have some rules for what constitutes junk tokens? (e.g. token not in dictionary)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Terence Gannon <bu...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, January 12, 2009 4:07:57 PM
> Subject: Re: Improving Readability of Hit Highlighting
> 
> To answer your questions specifically, here is an example of the raw OCR output;
> 
> "CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea"
> 
> to which I would like to see;
> 
> "mom ale access tour sheet to"
> 
> in the hit highlight.  My schema for this field is pretty much
> standard, as follows;
> 
> 
> 
> 
> 
> 
> 
> 
> When I examine the effect of each of these with the Analyzer, it seems
> like if I could use the output after LowerCaseFilterFactory in the hit
> highlight, I would come close to achieving what I want.
> 
> I'm not averse to doing the text cleanup external to Solr before the
> indexing, but only if it's *not* redundant to what the filter
> factories are going to do anyway.  Thanks for your help!
> 
> TerryG

Re: Improving Readability of Hit Highlighting

Posted by Terence Gannon <bu...@gmail.com>.

To answer your questions specifically, here is an example of the raw OCR output;

"CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea"

to which I would like to see;

"mom ale access tour sheet to"

in the hit highlight.  My schema for this field is pretty much
standard, as follows;

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ...
<filter class="solr.WordDelimiterFilterFactory" ...
<filter class="solr.LowerCaseFilterFactory" ...
<filter class="solr.EnglishPorterFilterFactory" ...
<filter class="solr.RemoveDuplicatesTokenFilterFactory ...

When I examine the effect of each of these with the Analyzer, it seems
like if I could use the output after LowerCaseFilterFactory in the hit
highlight, I would come close to achieving what I want.

I'm not averse to doing the text cleanup external to Solr before the
indexing, but only if it's *not* redundant to what the filter
factories are going to do anyway.  Thanks for your help!

TerryG

Re: Improving Readability of Hit Highlighting

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I'm not sure if I have a good suggestion, but I have a question. :)  What is considered "junk"?  Would it be possible to eliminate the junk before it even goes into the index in order to avoid GIGO (Garbage In Garbage Out)?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Terence Gannon <bu...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, January 12, 2009 11:00:31 AM
> Subject: Improving Readability of Hit Highlighting
> 
> I'm indexing text from an OCR of an old document.  Many words get read
> perfectly, but they're typically embedded in a lot of junk.  I would
> like the hit highlighting to show only the 'good' words, in the order
> in which they appeared in the original document.  Is it possible to
> use output of the filter classes as the text used in hit highlighting?
> Or do you have to all the text cleanup outside of Solr and present it
> with two fields to index, one with the original text, and one with the
> cleaned up text.  The objective of the hit highlighting is to give the
> user a *sense* of the original context, even if it's not provided
> verbatim from the original document.  Thanks in advance.
> 
> TerryG