You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Owens, Martin" <Ma...@merrillcorp.com> on 2007/11/30 22:02:34 UTC

Solr Highlighting, word index

Hello everyone,

We're working to replace the old Linux version of dtSearch with Lucene/Solr, using the http requests for our perl side and java for the indexing.

The functionality that is causing the most problems is the highlighting since we're not storing the text in solr (only indexing) and we need to highlight an image file (ocr) so what we really need is to request from solr the word indexes of the matches, we then tie this up to the ocr image and create html boxes to do the highlighting.

The text is also multi page, each page is seperated by Ctrl-L page breaks, should we handle the paging out selves or can Solr tell use which page the match happened on too?

Thanks for your help,

Best Regards, Martin Owens

Re: Solr, Multiple processes running

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Solr, Multiple processes running
: References: <F7...@EVS02.adminsys.mrll.com>
:      <1E...@gmail.com>
:     <F7...@EVS02.adminsys.mrll.com>
	...

http://people.apache.org/~hossman/#threadhijack

Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss


RE: Solr, Multiple processes running

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.
>> When replying, the number of documents is almost, but not quite
>> totally, useless unless combined with the number of fields you're
>> storing per doc, the average length of each field, etc <G>.....

2 million documents, 3 fields (1 content, 1 unique, 1 object id) per index

The content field is between 200k and 2MB of text.

Just to fill in any gaps.

Best Regards, Martin Owens

Re: Solr, Multiple processes running

Posted by Erick Erickson <er...@gmail.com>.
How much data are we talking about here? Because it seems *much* simpler
to just index a field with each document indicating the user and then just
AND that user's ID in with your query.

Or think about facets (although I admit I don't know enough about facets
to weigh in on its merits, it's just been mentioned a lot).

Keeping track of 1,000+ indexes seems like a maintenance headache, but
much depends upon how much data you're talking about.

When replying, the number of documents is almost, but not quite
totally, useless unless combined with the number of fields you're
storing per doc, the average length of each field, etc <G>.....

Erick

On Dec 11, 2007 4:01 PM, Owens, Martin <Ma...@merrillcorp.com> wrote:

> Hello everyone,
>
> The system we're moving from (dtSearch) allows each of our clients to have
> a search index. So far I have yet to find the options required to set this,
> it seems I can only set the directory path before run time.
>
> Each of the indexes uses the same schema, same configuration just
> different data in each; what kind of performance penalty would I have from
> running a new solr instance per required database? what is the best way to
> track what port or what index is being used? would I be able to run 1,000 or
> more solr instances without performance degradation?
>
> Thanks for your help.
>
> Best regards, Martin Owens
>

Re: Solr, Multiple processes running

Posted by Walter Underwood <wu...@netflix.com>.
Since they all use the same schema, can you add a client ID to each document
when it is indexed? Filter by "clientid:4" and you get a subset of the
index.

wunder

On 12/11/07 1:01 PM, "Owens, Martin" <Ma...@merrillcorp.com> wrote:

> Hello everyone,
> 
> The system we're moving from (dtSearch) allows each of our clients to have a
> search index. So far I have yet to find the options required to set this, it
> seems I can only set the directory path before run time.
> 
> Each of the indexes uses the same schema, same configuration just different
> data in each; what kind of performance penalty would I have from running a new
> solr instance per required database? what is the best way to track what port
> or what index is being used? would I be able to run 1,000 or more solr
> instances without performance degradation?
> 
> Thanks for your help.
> 
> Best regards, Martin Owens


Solr, Multiple processes running

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.
Hello everyone,

The system we're moving from (dtSearch) allows each of our clients to have a search index. So far I have yet to find the options required to set this, it seems I can only set the directory path before run time.

Each of the indexes uses the same schema, same configuration just different data in each; what kind of performance penalty would I have from running a new solr instance per required database? what is the best way to track what port or what index is being used? would I be able to run 1,000 or more solr instances without performance degradation?

Thanks for your help.

Best regards, Martin Owens

RE: Solr Highlighting, word index

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.
Hello Everyone,

Just to keep you all up to date about the maddness I've created. I managed to get the data I wanted by hacking:

lucene-2.2.0 highlight/Highlighter.java
solr-1.2 util/HighlightingUtils.java

I got it to output either the word index or pairs of letter offsets (end and start) depending on what is required. That means I'm getting the data I want. I'm not very happy with the code since it returns these numbers as strings and ends up loading and storing the entire file string. But beggars can't be choosers and I'm certainly no artisan at java.

Best Regards, Martin Owens

-----Original Message-----
From: Binkley, Peter [mailto:Peter.Binkley@ualberta.ca]
Sent: Wed 12/5/2007 4:07 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Highlighting, word index
 
We're doing a similar process using term vectors to look up the
bounding-box data in a custom response writer for a specific project,
but we're trying to get this packaged up in a more generally usable way
along with handling paging: see
https://issues.apache.org/jira/browse/SOLR-380. We're looking at using
Lucene's new payload functionality.

Peter

RE: Solr Highlighting, word index

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
We're doing a similar process using term vectors to look up the
bounding-box data in a custom response writer for a specific project,
but we're trying to get this packaged up in a more generally usable way
along with handling paging: see
https://issues.apache.org/jira/browse/SOLR-380. We're looking at using
Lucene's new payload functionality.

Peter

-----Original Message-----
From: Mike Klaas [mailto:mike.klaas@gmail.com] 
Sent: Wednesday, December 05, 2007 2:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Highlighting, word index


On 5-Dec-07, at 1:02 PM, Owens, Martin wrote:

> Thanks Mike, So in essence I need to write a new RequestHandler plugin

> which takes the query string, tokenises it then perform a some kind of

> action against the index to return results which I should then be able

> to get the termVectors from?

Search results in a request handler consist of a sorted list of integer
lucene doc ids.  You'll have to query the lucene api for the term
vectors corresponding to those doc ids.

> Would not the termVectors already be available from the normal search 
> and we'd just be asking for the term vectors from that?

No, term vectors are stored separately from the inverted index and are
not used in search.

> Any advice for a perl/python programmer who is trying to baddly hack 
> this in Java?

Unforunately, not really.  It's probably worth stepping back to learn
the basics of the lucene/solr apis before trying to accomplish your
specific goal.

-Mike


Re: Solr Highlighting, word index

Posted by Mike Klaas <mi...@gmail.com>.
On 5-Dec-07, at 1:02 PM, Owens, Martin wrote:

> Thanks Mike, So in essence I need to write a new RequestHandler  
> plugin which takes the query string, tokenises it then perform a  
> some kind of action against the index to return results which I  
> should then be able to get the termVectors from?

Search results in a request handler consist of a sorted list of  
integer lucene doc ids.  You'll have to query the lucene api for the  
term vectors corresponding to those doc ids.

> Would not the termVectors already be available from the normal  
> search and we'd just be asking for the term vectors from that?

No, term vectors are stored separately from the inverted index and  
are not used in search.

> Any advice for a perl/python programmer who is trying to baddly  
> hack this in Java?

Unforunately, not really.  It's probably worth stepping back to learn  
the basics of the lucene/solr apis before trying to accomplish your  
specific goal.

-Mike

RE: Solr Highlighting, word index

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.
You do not necessarily need two requests; instead, you can override  
or modify the request handler you are using (StandardRequestHandler,  
DisMaxREquestHandler) to return the information.  You'll have to  
process the Query to extract the terms (like HighlighingUtils does),  
then get the TermVector token offset data for each matching doc and  
look for the terms in the Query.  I haven't worked with Term Vectors  
(a Lucene API), so I'm not sure exactly how to go about this.

Thanks Mike, So in essence I need to write a new RequestHandler plugin which takes the query string, tokenises it then perform a some kind of action against the index to return results which I should then be able to get the termVectors from?

Would not the termVectors already be available from the normal search and we'd just be asking for the term vectors from that?

Any advice for a perl/python programmer who is trying to baddly hack this in Java?

Best Regards, Martin Owens

Re: Solr Highlighting, word index

Posted by Mike Klaas <mi...@gmail.com>.
On 3-Dec-07, at 10:58 AM, Owens, Martin wrote:

>
>
>> You can tell lucene to store token offsets using TermVectors
>> (configurable via schema.xml).  Then you can customize the request
>> handler to return the token offsets (and/or positions) by retrieving
>> the TVs.
>
> I think that is the best plan of action, how do I create a custom  
> request handler that will use the existing indexed fields? There  
> will be 2 requests as I see it, 1 for the search and 1 to retrieve  
> the offsets when you view one of those found items. Any advice you  
> can give me will be much appricated as I've had no luck with google  
> so far.

First, you need to store token offets for the field:
See http://wiki.apache.org/solr/SchemaXml , "Expert field options".   
You definitely want termVectors=true, termOffsets=true.

You do not necessarily need two requests; instead, you can override  
or modify the request handler you are using (StandardRequestHandler,  
DisMaxREquestHandler) to return the information.  You'll have to  
process the Query to extract the terms (like HighlighingUtils does),  
then get the TermVector token offset data for each matching doc and  
look for the terms in the Query.  I haven't worked with Term Vectors  
(a Lucene API), so I'm not sure exactly how to go about this.

-Mike

RE: Solr Highlighting, word index

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.

> You can tell lucene to store token offsets using TermVectors  
> (configurable via schema.xml).  Then you can customize the request  
> handler to return the token offsets (and/or positions) by retrieving  
> the TVs.

I think that is the best plan of action, how do I create a custom request handler that will use the existing indexed fields? There will be 2 requests as I see it, 1 for the search and 1 to retrieve the offsets when you view one of those found items. Any advice you can give me will be much appricated as I've had no luck with google so far.

Thanks for your help so far,

Best Regards, Martin Owens


solr ubuntu and tomcat

Posted by Yousef Ourabi <yo...@zero-analog.com>.
Has anyone else had any trouble running solr on ubuntu with the apt installed tomcat? (Not a download from apache.org)

I'm having a bear of a time.

On Debian Etch I managed to get Solr working by setting TOMCAT_SECURITY=no in /etc/default/tomcat5.5 

The same solr.xml (Context) on Ubuntu fails with the NoClassDefFoundError as if I had not set solr/home in the context -- however both SOLR/HOME and the docBase are correct (ie they are there, owned by tomcat55.adm with liberal permissions 777)

Any thoughts? Anyone else have a similar experience?

Thanks.
Yousef 

RE: DynamicField and FacetFields..

Posted by Jeryl Cook <tw...@hotmail.com>.
fixed, i had a typo...may want to delete my post( i want to :P .)

Jeryl Cook  
> From: twoencore@hotmail.com
> To: solr-user@lucene.apache.org
> Subject: DynamicField and  FacetFields..
> Date: Sat, 1 Dec 2007 14:21:12 -0500
> 
> Question:
> I need to dynamically data to SOLR, so I do not have a "predefined" list of field names...
> 
> so i use the dynamicField option in the schma and match approioate datatype..
> in my schema.xml
> <field name="id" type="string" indexed="true" stored="true" required="true" />
> 
> <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
> 
> 
> 
> Then programatically my code
> ...
> document.addField( dynamicFieldName + "_s",  dynamicFieldValue, 10 ); 
> facetFieldNames.put( dynamicFieldName + "_s",null);//TODO:use copyField..
> server.add( document,true );
> server.commit();
> 
> when i attempt to graph results, i want to display 
>         SolrQuery query = new SolrQuery();
>         query.setQuery( "*:*" );
>         query.setFacetLimit(10);//TODO:
>         Iterator facetsIt = facetFieldNames.entrySet().iterator();
>         while(facetsIt.hasNext()){
>             Entry<String,String>entry = (Entry)facetsIt.next();
>             String facetName = (String)entry.getKey();
>             query.addFacetField(facetName);
>         }
>      
>         QueryResponse rsp;
>        
>             rsp = server.query( query );
>            List<FacetField> facetFieldList = rsp.getFacetFields();         
>            assertNotNull(facetFieldList);
> 
>    ....
> 
> 
> my facetFieldList is null, of course if i addFacetField if "id" it works..because i define it in the schema.xml........
> 
> is this just a something that is not implemented? or am i missing something...
> 
> Thanks.
> 
> 
> 
> Jeryl Cook 
> 
> 
> 
> /^\ Pharaoh /^\ 
> 
> http://pharaohofkush.blogspot.com/ 
> 
> 
> 
> "..Act your age, and not your shoe size.."
> 
> -Prince(1986)
> 
> > Date: Fri, 30 Nov 2007 21:23:59 -0500
> > From: erickerickson@gmail.com
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr Highlighting, word index
> > 
> > It's good you already have the data because if you somehow got it from
> > some sort of calculations I'd have to tell my product manager that
> > the feature he wanted that I told him couldn't be done with our data
> > was possible after all <G>...
> > 
> > About page breaks:
> > 
> > Another approach to paging is to index a special page token with an
> > increment of 0 from the last word of the page. Say you have the following:
> > last ctrl-l first. Then index last, $$$$$$$ at an increment of 0 then first.
> > 
> > You can then quite quickly calculate the pages by using
> > termdocs/termenum on your special token and count.
> > 
> > Which approach you use depends upon whether you want span and/or
> > phrase queries to match across page boundaries. If you use an increment as
> > Mike suggests, matching "last first"~3 won't work. It just depends upon
> > whether how you want to match across the page break.
> > 
> > Best
> > Erick
> > 
> > On Nov 30, 2007 4:37 PM, Mike Klaas <mi...@gmail.com> wrote:
> > 
> > > On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:
> > >
> > > >
> > > > Hello everyone,
> > > >
> > > > We're working to replace the old Linux version of dtSearch with
> > > > Lucene/Solr, using the http requests for our perl side and java for
> > > > the indexing.
> > > >
> > > > The functionality that is causing the most problems is the
> > > > highlighting since we're not storing the text in solr (only
> > > > indexing) and we need to highlight an image file (ocr) so what we
> > > > really need is to request from solr the word indexes of the
> > > > matches, we then tie this up to the ocr image and create html boxes
> > > > to do the highlighting.
> > >
> > > This isn't possible with Solr out-of-the-box.  Also, the usual
> > > methods for highlighting won't work because Solr typically re-
> > > analyzes the raw text to find the appropriate highlighting points.
> > > However, it shouldn't be too hard to come up with a custom solution.
> > > You can tell lucene to store token offsets using TermVectors
> > > (configurable via schema.xml).  Then you can customize the request
> > > handler to return the token offsets (and/or positions) by retrieving
> > > the TVs.
> > >
> > > > The text is also multi page, each page is seperated by Ctrl-L page
> > > > breaks, should we handle the paging out selves or can Solr tell use
> > > > which page the match happened on too?
> > >
> > > Again, not automatically.  However, if you wrote an analyzer that
> > > bumped up the position increment of tokens every time a new page was
> > > found (to, say the next multiple of 1000), then you infer the
> > > matching page by the token position.
> > >
> > > cheers,
> > > -Mike
> > >

DynamicField and FacetFields..

Posted by Jeryl Cook <tw...@hotmail.com>.
Question:
I need to dynamically data to SOLR, so I do not have a "predefined" list of field names...

so i use the dynamicField option in the schma and match approioate datatype..
in my schema.xml
<field name="id" type="string" indexed="true" stored="true" required="true" />

<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>



Then programatically my code
...
document.addField( dynamicFieldName + "_s",  dynamicFieldValue, 10 ); 
facetFieldNames.put( dynamicFieldName + "_s",null);//TODO:use copyField..
server.add( document,true );
server.commit();

when i attempt to graph results, i want to display 
        SolrQuery query = new SolrQuery();
        query.setQuery( "*:*" );
        query.setFacetLimit(10);//TODO:
        Iterator facetsIt = facetFieldNames.entrySet().iterator();
        while(facetsIt.hasNext()){
            Entry<String,String>entry = (Entry)facetsIt.next();
            String facetName = (String)entry.getKey();
            query.addFacetField(facetName);
        }
     
        QueryResponse rsp;
       
            rsp = server.query( query );
           List<FacetField> facetFieldList = rsp.getFacetFields();         
           assertNotNull(facetFieldList);

   ....


my facetFieldList is null, of course if i addFacetField if "id" it works..because i define it in the schema.xml........

is this just a something that is not implemented? or am i missing something...

Thanks.



Jeryl Cook 



/^\ Pharaoh /^\ 

http://pharaohofkush.blogspot.com/ 



"..Act your age, and not your shoe size.."

-Prince(1986)

> Date: Fri, 30 Nov 2007 21:23:59 -0500
> From: erickerickson@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Highlighting, word index
> 
> It's good you already have the data because if you somehow got it from
> some sort of calculations I'd have to tell my product manager that
> the feature he wanted that I told him couldn't be done with our data
> was possible after all <G>...
> 
> About page breaks:
> 
> Another approach to paging is to index a special page token with an
> increment of 0 from the last word of the page. Say you have the following:
> last ctrl-l first. Then index last, $$$$$$$ at an increment of 0 then first.
> 
> You can then quite quickly calculate the pages by using
> termdocs/termenum on your special token and count.
> 
> Which approach you use depends upon whether you want span and/or
> phrase queries to match across page boundaries. If you use an increment as
> Mike suggests, matching "last first"~3 won't work. It just depends upon
> whether how you want to match across the page break.
> 
> Best
> Erick
> 
> On Nov 30, 2007 4:37 PM, Mike Klaas <mi...@gmail.com> wrote:
> 
> > On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:
> >
> > >
> > > Hello everyone,
> > >
> > > We're working to replace the old Linux version of dtSearch with
> > > Lucene/Solr, using the http requests for our perl side and java for
> > > the indexing.
> > >
> > > The functionality that is causing the most problems is the
> > > highlighting since we're not storing the text in solr (only
> > > indexing) and we need to highlight an image file (ocr) so what we
> > > really need is to request from solr the word indexes of the
> > > matches, we then tie this up to the ocr image and create html boxes
> > > to do the highlighting.
> >
> > This isn't possible with Solr out-of-the-box.  Also, the usual
> > methods for highlighting won't work because Solr typically re-
> > analyzes the raw text to find the appropriate highlighting points.
> > However, it shouldn't be too hard to come up with a custom solution.
> > You can tell lucene to store token offsets using TermVectors
> > (configurable via schema.xml).  Then you can customize the request
> > handler to return the token offsets (and/or positions) by retrieving
> > the TVs.
> >
> > > The text is also multi page, each page is seperated by Ctrl-L page
> > > breaks, should we handle the paging out selves or can Solr tell use
> > > which page the match happened on too?
> >
> > Again, not automatically.  However, if you wrote an analyzer that
> > bumped up the position increment of tokens every time a new page was
> > found (to, say the next multiple of 1000), then you infer the
> > matching page by the token position.
> >
> > cheers,
> > -Mike
> >

Re: Solr result offsets

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 5, 2007 3:06 PM, Owens, Martin <Ma...@merrillcorp.com> wrote:
> surely the term offsets are returned when a search is done on a field with that data available?

Nope.
That data isn't even stored in the index unless you store termvectors
with that info... and even in that case the info is more like a stored
field (not accessed, utilized, or available during a search).

-Yonik

Solr result offsets

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.
Hello again,

So I've been concentrating on hacking the Util/Highlighting.java to see if I could get it to output the results offsets I need to do the highlighting I need. It turns out that this method requires that the field be stored as well as indexed. I would like to be able to just set termOffsets and termPositions and have that data returned to me when I do a specific kind of search.

I ended up getting very confused about the Request Handler plugin that I'm sure will be the solution in the end; It just seems to want the search to be performed again for no good reason, surely the term offsets are returned when a search is done on a field with that data available?

Best Regards, Martin Owens


Re: Solr Highlighting, word index

Posted by Erick Erickson <er...@gmail.com>.
It's good you already have the data because if you somehow got it from
some sort of calculations I'd have to tell my product manager that
the feature he wanted that I told him couldn't be done with our data
was possible after all <G>...

About page breaks:

Another approach to paging is to index a special page token with an
increment of 0 from the last word of the page. Say you have the following:
last ctrl-l first. Then index last, $$$$$$$ at an increment of 0 then first.

You can then quite quickly calculate the pages by using
termdocs/termenum on your special token and count.

Which approach you use depends upon whether you want span and/or
phrase queries to match across page boundaries. If you use an increment as
Mike suggests, matching "last first"~3 won't work. It just depends upon
whether how you want to match across the page break.

Best
Erick

On Nov 30, 2007 4:37 PM, Mike Klaas <mi...@gmail.com> wrote:

> On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:
>
> >
> > Hello everyone,
> >
> > We're working to replace the old Linux version of dtSearch with
> > Lucene/Solr, using the http requests for our perl side and java for
> > the indexing.
> >
> > The functionality that is causing the most problems is the
> > highlighting since we're not storing the text in solr (only
> > indexing) and we need to highlight an image file (ocr) so what we
> > really need is to request from solr the word indexes of the
> > matches, we then tie this up to the ocr image and create html boxes
> > to do the highlighting.
>
> This isn't possible with Solr out-of-the-box.  Also, the usual
> methods for highlighting won't work because Solr typically re-
> analyzes the raw text to find the appropriate highlighting points.
> However, it shouldn't be too hard to come up with a custom solution.
> You can tell lucene to store token offsets using TermVectors
> (configurable via schema.xml).  Then you can customize the request
> handler to return the token offsets (and/or positions) by retrieving
> the TVs.
>
> > The text is also multi page, each page is seperated by Ctrl-L page
> > breaks, should we handle the paging out selves or can Solr tell use
> > which page the match happened on too?
>
> Again, not automatically.  However, if you wrote an analyzer that
> bumped up the position increment of tokens every time a new page was
> found (to, say the next multiple of 1000), then you infer the
> matching page by the token position.
>
> cheers,
> -Mike
>

Re: Solr Highlighting, word index

Posted by Mike Klaas <mi...@gmail.com>.
On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:

>
> Hello everyone,
>
> We're working to replace the old Linux version of dtSearch with  
> Lucene/Solr, using the http requests for our perl side and java for  
> the indexing.
>
> The functionality that is causing the most problems is the  
> highlighting since we're not storing the text in solr (only  
> indexing) and we need to highlight an image file (ocr) so what we  
> really need is to request from solr the word indexes of the  
> matches, we then tie this up to the ocr image and create html boxes  
> to do the highlighting.

This isn't possible with Solr out-of-the-box.  Also, the usual  
methods for highlighting won't work because Solr typically re- 
analyzes the raw text to find the appropriate highlighting points.   
However, it shouldn't be too hard to come up with a custom solution.   
You can tell lucene to store token offsets using TermVectors  
(configurable via schema.xml).  Then you can customize the request  
handler to return the token offsets (and/or positions) by retrieving  
the TVs.

> The text is also multi page, each page is seperated by Ctrl-L page  
> breaks, should we handle the paging out selves or can Solr tell use  
> which page the match happened on too?

Again, not automatically.  However, if you wrote an analyzer that  
bumped up the position increment of tokens every time a new page was  
found (to, say the next multiple of 1000), then you infer the  
matching page by the token position.

cheers,
-Mike

RE: Solr Highlighting, word index

Posted by "Owens, Martin" <Ma...@merrillcorp.com>.
> Or I'm just completely off base here.....

A little, we already have the locations for each word on every ocr, we just need the word index to feed into the existing program.

Best Regards, Martin Owens

Re: Solr Highlighting, word index

Posted by Erick Erickson <er...@gmail.com>.
Oh, good luck on this! I've had similar issues and have just thrown up my
hands. How do you expect to be able to correlate a word in the index
with the bounding box in the OCR? I'm not sure this is a solved problem
unless your OCR is *very* regular and clean. Even if you can calculate
the ordinal position of the word, you'd be hosed if the OCR image was,
say, slightly tilted at scan time.

Or do you have information about where on the page every word is that
you somehow store and retrieve for highlighting purposes?

Because I don't think this is a problem that Lucene/Solr can solve, unless
I just completely fail to understand things. Which wouldn't be the first
time...

Unless you have data telling you where each word appears in the original
OCR,
I don't know how you'd go from knowing that a word appeared on a page to
being able to calculate its bounding box. And if you *do* have this info,
you
don't need Lucene/Solr to know what to highlight, all you need is to know
which
pages which words appear on. Which is a non-trivial thing to get back from
Lucene/Solr, but at least that's do-able.

Or I'm just completely off base here.....

Best
Erick




On Nov 30, 2007 4:02 PM, Owens, Martin <Ma...@merrillcorp.com> wrote:

>
> Hello everyone,
>
> We're working to replace the old Linux version of dtSearch with
> Lucene/Solr, using the http requests for our perl side and java for the
> indexing.
>
> The functionality that is causing the most problems is the highlighting
> since we're not storing the text in solr (only indexing) and we need to
> highlight an image file (ocr) so what we really need is to request from solr
> the word indexes of the matches, we then tie this up to the ocr image and
> create html boxes to do the highlighting.
>
> The text is also multi page, each page is seperated by Ctrl-L page breaks,
> should we handle the paging out selves or can Solr tell use which page the
> match happened on too?
>
> Thanks for your help,
>
> Best Regards, Martin Owens
>

Re: Solr Highlighting, word index

Posted by Ryan McKinley <ry...@gmail.com>.
Owens, Martin wrote:
> Hello everyone,
> 
> We're working to replace the old Linux version of dtSearch with Lucene/Solr, using the http requests for our perl side and java for the indexing.
> 
> The functionality that is causing the most problems is the highlighting since we're not storing the text in solr (only indexing) and we need to highlight an image file (ocr) so what we really need is to request from solr the word indexes of the matches, we then tie this up to the ocr image and create html boxes to do the highlighting.
> 

Sorry this hasn't had a response....

I'm not totally following what you are trying to do.  If I understand 
it, you want to use solr to get back the matching highlighting areas 
from an arbitrary bit of text that is not stored in the index?

Off hand, there is nothing out of the box to do this (mike?)

My guess is you will have to write a custom requestHandler that pulls 
the stored text from wherever you store it, then pass it to a custom 
Formatter that includes the offsets in the response.

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/src/java/org/apache/lucene/search/highlight/Formatter.java

In 1.3-dev (/trunk) you can register a custom Formatter in solrconfig.xml


ryan