You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2011/06/20 14:41:36 UTC

Extending Solr Highlighter to pull information from external source

I am trying to index data where I'm concerned that storing the contents of a
specific field will be a bit of a hog so we are planning to retrieve this
information as needed for highlighting from an external source.  I am
looking to extend the default solr highlighting capability to work with
information pulled from this external source and it looks like this is
possible by extending DefaultSolrHighlighter (line 418 to pull a particular
field from external source) for standard highlighting and
BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just hard
code this to say if the field name is a specific value look into the
external source, is this the best way to accomplish this?  Are there any
other extension points to do what I'm suggesting?

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

I haven't seen any interest in this, but for anyone following, I
updated the alternateField logic to support pulling from the external
field if available.  Would be useful to know how to get solr to use
this external field provider in general so we wouldn't have to modify
the highlighter at all, just whatever was building the document.

On Fri, Jul 15, 2011 at 5:08 PM, Jamie Johnson <je...@gmail.com> wrote:
> I tried the patch at SOLR-1397 but it didn't work as I'd expect.
>
> <lst name="highlighting">
>    <lst name="1">
>        <arr name="subject_phonetic">
>            <str><em>Test</em> subject message</str>
>        </arr>
>        <arr name="subject_phonetic_startPos"><int>0</int></arr>
>        <arr name="subject_phonetic_endPos"><int>29</int></arr>
>    </lst>
> </lst>
> The start position is right, but the end position seems to be the
> length of the field.
>
>
> On Fri, Jul 15, 2011 at 4:25 PM, Jamie Johnson <je...@gmail.com> wrote:
>> I added the highlighting code I am using to this JIRA
>> (https://issues.apache.org/jira/browse/SOLR-1397).  Afterwards I
>> noticed this JIRA (https://issues.apache.org/jira/browse/SOLR-1954)
>> which talks about another solution.  I think David's patch would have
>> worked equally well for my problem, just would require later doing the
>> highlighting on the clients end.  I'll have to give this a whirl over
>> the weekend.
>>
>> On Fri, Jul 15, 2011 at 3:55 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> Boy it's been a long time since I first wrote this, sorry for the delay....
>>>
>>> I think I have this working as I expect with a test implementation.  I
>>> created the following interface
>>>
>>> public interface SolrExternalFieldProvider extends NamedListInitializedPlugin {
>>>        public String[] getFieldContent(String key, SchemaField field,
>>> SolrQueryRequest request);
>>> }
>>>
>>> I then added to DefaultSolrHighlighter the following:
>>>
>>> in init()
>>>
>>> SolrExternalFieldProvider defaultProvider =
>>> solrCore.initPlugins(info.getChildren("externalFieldProvider") ,
>>> externalFieldProviders,SolrExternalFieldProvider.class,null);
>>>            if(defaultProvider != null){
>>>                externalFieldProviders.put("", defaultProvider);
>>>                externalFieldProviders.put(null, defaultProvider);
>>>            }
>>> then in doHighlightByHighlighter I added the following
>>>
>>> if(schemaField != null && !schemaField.stored()){
>>>                        SolrExternalFieldProvider externalFieldProvider =
>>> this.getExternalFieldProvider(fieldName, params);
>>>                        if(externalFieldProvider != null){
>>>                    SchemaField keyField = schema.getUniqueKeyField();
>>>                    String key = doc.getValues(keyField.getName())[0];  //I
>>> know this field exists and is not multivalued
>>>                    if(key != null && key.length() > 0){
>>>                        docTexts = externalFieldProvider.getFieldContent(key,
>>> schemaField, req);
>>>                    }
>>>                        } else {
>>>                                docTexts = new String[]{};
>>>                        }
>>>                }
>>>
>>>                else {
>>>                docTexts = doc.getValues(fieldName);
>>>        }
>>>
>>>
>>> This worked for me.  I needed to include the req because there are
>>> some additional thing that I need to have from it, I figure this is
>>> probably something else folks will need as well.  I tried to follow
>>> the pattern used for the other highlighter pieces in that you can have
>>> different externalFieldProviders for each field.  I'm more than happy
>>> to share the actual classes with the community or add them to one of
>>> the JIRA issues mentioned below, I haven't done so yet because I don't
>>> know how to build patches.
>>>
>>> On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov <so...@ifactory.com> wrote:
>>>> I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not
>>>> much going on there
>>>>
>>>> LUCENE-1522 <https://issues.apache.org/jira/browse/LUCENE-1522>has a lot of
>>>> fascinating discussion on this topic though
>>>>
>>>>
>>>>> There is a couple of long lived issues in jira for this (I'd like to try
>>>>> to search
>>>>> them, but I couldn't access jira now).
>>>>>
>>>>> For FVH, it is needed to be modified at Lucene level to use external data.
>>>>>
>>>>> koji
>>>>
>>>> Koji - is that really so?  It appears to me that would could extend
>>>> BaseFragmentsBuilder and override
>>>>
>>>> createFragments(IndexReader reader, int docId,
>>>>      String fieldName, FieldFragList fieldFragList, int maxNumFragments,
>>>>      String[] preTags, String[] postTags, Encoder encoder )
>>>>
>>>> providing a version that retrieves text from some external source rather
>>>> than from Lucene fields.
>>>>
>>>> It sounds to me like a really useful modification in Lucene core would be to
>>>> retain match points that have already been computed during scoring so the
>>>> highlighter doesn't have to attempt to reinvent all that logic!  This has
>>>> all been discussed at length in LUCENE-1522 already, but is there is any
>>>> recent activity?
>>>>
>>>> My hope is that since (at least in my test) search code seems to spend 80%
>>>> of its time highlighting, folks will take up this banner and do the plumbing
>>>> needed to improve it - should lead to huge speed-ups for searching!  I'm
>>>> continuing to read, but not really capable of making a meaningful
>>>> contribution at this point.
>>>>
>>>> -Mike
>>>>
>>>
>>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

I tried the patch at SOLR-1397 but it didn't work as I'd expect.

<lst name="highlighting">
    <lst name="1">
        <arr name="subject_phonetic">
            <str><em>Test</em> subject message</str>
        </arr>
        <arr name="subject_phonetic_startPos"><int>0</int></arr>
        <arr name="subject_phonetic_endPos"><int>29</int></arr>
    </lst>
</lst>
The start position is right, but the end position seems to be the
length of the field.


On Fri, Jul 15, 2011 at 4:25 PM, Jamie Johnson <je...@gmail.com> wrote:
> I added the highlighting code I am using to this JIRA
> (https://issues.apache.org/jira/browse/SOLR-1397).  Afterwards I
> noticed this JIRA (https://issues.apache.org/jira/browse/SOLR-1954)
> which talks about another solution.  I think David's patch would have
> worked equally well for my problem, just would require later doing the
> highlighting on the clients end.  I'll have to give this a whirl over
> the weekend.
>
> On Fri, Jul 15, 2011 at 3:55 PM, Jamie Johnson <je...@gmail.com> wrote:
>> Boy it's been a long time since I first wrote this, sorry for the delay....
>>
>> I think I have this working as I expect with a test implementation.  I
>> created the following interface
>>
>> public interface SolrExternalFieldProvider extends NamedListInitializedPlugin {
>>        public String[] getFieldContent(String key, SchemaField field,
>> SolrQueryRequest request);
>> }
>>
>> I then added to DefaultSolrHighlighter the following:
>>
>> in init()
>>
>> SolrExternalFieldProvider defaultProvider =
>> solrCore.initPlugins(info.getChildren("externalFieldProvider") ,
>> externalFieldProviders,SolrExternalFieldProvider.class,null);
>>            if(defaultProvider != null){
>>                externalFieldProviders.put("", defaultProvider);
>>                externalFieldProviders.put(null, defaultProvider);
>>            }
>> then in doHighlightByHighlighter I added the following
>>
>> if(schemaField != null && !schemaField.stored()){
>>                        SolrExternalFieldProvider externalFieldProvider =
>> this.getExternalFieldProvider(fieldName, params);
>>                        if(externalFieldProvider != null){
>>                    SchemaField keyField = schema.getUniqueKeyField();
>>                    String key = doc.getValues(keyField.getName())[0];  //I
>> know this field exists and is not multivalued
>>                    if(key != null && key.length() > 0){
>>                        docTexts = externalFieldProvider.getFieldContent(key,
>> schemaField, req);
>>                    }
>>                        } else {
>>                                docTexts = new String[]{};
>>                        }
>>                }
>>
>>                else {
>>                docTexts = doc.getValues(fieldName);
>>        }
>>
>>
>> This worked for me.  I needed to include the req because there are
>> some additional thing that I need to have from it, I figure this is
>> probably something else folks will need as well.  I tried to follow
>> the pattern used for the other highlighter pieces in that you can have
>> different externalFieldProviders for each field.  I'm more than happy
>> to share the actual classes with the community or add them to one of
>> the JIRA issues mentioned below, I haven't done so yet because I don't
>> know how to build patches.
>>
>> On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov <so...@ifactory.com> wrote:
>>> I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not
>>> much going on there
>>>
>>> LUCENE-1522 <https://issues.apache.org/jira/browse/LUCENE-1522>has a lot of
>>> fascinating discussion on this topic though
>>>
>>>
>>>> There is a couple of long lived issues in jira for this (I'd like to try
>>>> to search
>>>> them, but I couldn't access jira now).
>>>>
>>>> For FVH, it is needed to be modified at Lucene level to use external data.
>>>>
>>>> koji
>>>
>>> Koji - is that really so?  It appears to me that would could extend
>>> BaseFragmentsBuilder and override
>>>
>>> createFragments(IndexReader reader, int docId,
>>>      String fieldName, FieldFragList fieldFragList, int maxNumFragments,
>>>      String[] preTags, String[] postTags, Encoder encoder )
>>>
>>> providing a version that retrieves text from some external source rather
>>> than from Lucene fields.
>>>
>>> It sounds to me like a really useful modification in Lucene core would be to
>>> retain match points that have already been computed during scoring so the
>>> highlighter doesn't have to attempt to reinvent all that logic!  This has
>>> all been discussed at length in LUCENE-1522 already, but is there is any
>>> recent activity?
>>>
>>> My hope is that since (at least in my test) search code seems to spend 80%
>>> of its time highlighting, folks will take up this banner and do the plumbing
>>> needed to improve it - should lead to huge speed-ups for searching!  I'm
>>> continuing to read, but not really capable of making a meaningful
>>> contribution at this point.
>>>
>>> -Mike
>>>
>>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

I added the highlighting code I am using to this JIRA
(https://issues.apache.org/jira/browse/SOLR-1397).  Afterwards I
noticed this JIRA (https://issues.apache.org/jira/browse/SOLR-1954)
which talks about another solution.  I think David's patch would have
worked equally well for my problem, just would require later doing the
highlighting on the clients end.  I'll have to give this a whirl over
the weekend.

On Fri, Jul 15, 2011 at 3:55 PM, Jamie Johnson <je...@gmail.com> wrote:
> Boy it's been a long time since I first wrote this, sorry for the delay....
>
> I think I have this working as I expect with a test implementation.  I
> created the following interface
>
> public interface SolrExternalFieldProvider extends NamedListInitializedPlugin {
>        public String[] getFieldContent(String key, SchemaField field,
> SolrQueryRequest request);
> }
>
> I then added to DefaultSolrHighlighter the following:
>
> in init()
>
> SolrExternalFieldProvider defaultProvider =
> solrCore.initPlugins(info.getChildren("externalFieldProvider") ,
> externalFieldProviders,SolrExternalFieldProvider.class,null);
>            if(defaultProvider != null){
>                externalFieldProviders.put("", defaultProvider);
>                externalFieldProviders.put(null, defaultProvider);
>            }
> then in doHighlightByHighlighter I added the following
>
> if(schemaField != null && !schemaField.stored()){
>                        SolrExternalFieldProvider externalFieldProvider =
> this.getExternalFieldProvider(fieldName, params);
>                        if(externalFieldProvider != null){
>                    SchemaField keyField = schema.getUniqueKeyField();
>                    String key = doc.getValues(keyField.getName())[0];  //I
> know this field exists and is not multivalued
>                    if(key != null && key.length() > 0){
>                        docTexts = externalFieldProvider.getFieldContent(key,
> schemaField, req);
>                    }
>                        } else {
>                                docTexts = new String[]{};
>                        }
>                }
>
>                else {
>                docTexts = doc.getValues(fieldName);
>        }
>
>
> This worked for me.  I needed to include the req because there are
> some additional thing that I need to have from it, I figure this is
> probably something else folks will need as well.  I tried to follow
> the pattern used for the other highlighter pieces in that you can have
> different externalFieldProviders for each field.  I'm more than happy
> to share the actual classes with the community or add them to one of
> the JIRA issues mentioned below, I haven't done so yet because I don't
> know how to build patches.
>
> On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov <so...@ifactory.com> wrote:
>> I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not
>> much going on there
>>
>> LUCENE-1522 <https://issues.apache.org/jira/browse/LUCENE-1522>has a lot of
>> fascinating discussion on this topic though
>>
>>
>>> There is a couple of long lived issues in jira for this (I'd like to try
>>> to search
>>> them, but I couldn't access jira now).
>>>
>>> For FVH, it is needed to be modified at Lucene level to use external data.
>>>
>>> koji
>>
>> Koji - is that really so?  It appears to me that would could extend
>> BaseFragmentsBuilder and override
>>
>> createFragments(IndexReader reader, int docId,
>>      String fieldName, FieldFragList fieldFragList, int maxNumFragments,
>>      String[] preTags, String[] postTags, Encoder encoder )
>>
>> providing a version that retrieves text from some external source rather
>> than from Lucene fields.
>>
>> It sounds to me like a really useful modification in Lucene core would be to
>> retain match points that have already been computed during scoring so the
>> highlighter doesn't have to attempt to reinvent all that logic!  This has
>> all been discussed at length in LUCENE-1522 already, but is there is any
>> recent activity?
>>
>> My hope is that since (at least in my test) search code seems to spend 80%
>> of its time highlighting, folks will take up this banner and do the plumbing
>> needed to improve it - should lead to huge speed-ups for searching!  I'm
>> continuing to read, but not really capable of making a meaningful
>> contribution at this point.
>>
>> -Mike
>>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

Boy it's been a long time since I first wrote this, sorry for the delay....

I think I have this working as I expect with a test implementation.  I
created the following interface

public interface SolrExternalFieldProvider extends NamedListInitializedPlugin {
	public String[] getFieldContent(String key, SchemaField field,
SolrQueryRequest request);
}

I then added to DefaultSolrHighlighter the following:

in init()

SolrExternalFieldProvider defaultProvider =
solrCore.initPlugins(info.getChildren("externalFieldProvider") ,
externalFieldProviders,SolrExternalFieldProvider.class,null);
	    if(defaultProvider != null){
	    	externalFieldProviders.put("", defaultProvider);
	    	externalFieldProviders.put(null, defaultProvider);
	    }
then in doHighlightByHighlighter I added the following

if(schemaField != null && !schemaField.stored()){
			SolrExternalFieldProvider externalFieldProvider =
this.getExternalFieldProvider(fieldName, params);
			if(externalFieldProvider != null){
	            SchemaField keyField = schema.getUniqueKeyField();
	            String key = doc.getValues(keyField.getName())[0];  //I
know this field exists and is not multivalued
	            if(key != null && key.length() > 0){
	            	docTexts = externalFieldProvider.getFieldContent(key,
schemaField, req);
	            }
			} else {
				docTexts = new String[]{};
			}
		}
		
		else {
        	docTexts = doc.getValues(fieldName);
        }


This worked for me.  I needed to include the req because there are
some additional thing that I need to have from it, I figure this is
probably something else folks will need as well.  I tried to follow
the pattern used for the other highlighter pieces in that you can have
different externalFieldProviders for each field.  I'm more than happy
to share the actual classes with the community or add them to one of
the JIRA issues mentioned below, I haven't done so yet because I don't
know how to build patches.

On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov <so...@ifactory.com> wrote:
> I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not
> much going on there
>
> LUCENE-1522 <https://issues.apache.org/jira/browse/LUCENE-1522>has a lot of
> fascinating discussion on this topic though
>
>
>> There is a couple of long lived issues in jira for this (I'd like to try
>> to search
>> them, but I couldn't access jira now).
>>
>> For FVH, it is needed to be modified at Lucene level to use external data.
>>
>> koji
>
> Koji - is that really so?  It appears to me that would could extend
> BaseFragmentsBuilder and override
>
> createFragments(IndexReader reader, int docId,
>      String fieldName, FieldFragList fieldFragList, int maxNumFragments,
>      String[] preTags, String[] postTags, Encoder encoder )
>
> providing a version that retrieves text from some external source rather
> than from Lucene fields.
>
> It sounds to me like a really useful modification in Lucene core would be to
> retain match points that have already been computed during scoring so the
> highlighter doesn't have to attempt to reinvent all that logic!  This has
> all been discussed at length in LUCENE-1522 already, but is there is any
> recent activity?
>
> My hope is that since (at least in my test) search code seems to spend 80%
> of its time highlighting, folks will take up this banner and do the plumbing
> needed to improve it - should lead to huge speed-ups for searching!  I'm
> continuing to read, but not really capable of making a meaningful
> contribution at this point.
>
> -Mike
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Michael Sokolov <so...@ifactory.com>.

I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not 
much going on there

LUCENE-1522 <https://issues.apache.org/jira/browse/LUCENE-1522>has a lot 
of fascinating discussion on this topic though


> There is a couple of long lived issues in jira for this (I'd like to 
> try to search
> them, but I couldn't access jira now).
>
> For FVH, it is needed to be modified at Lucene level to use external 
> data.
>
> koji
Koji - is that really so?  It appears to me that would could extend 
BaseFragmentsBuilder and override

createFragments(IndexReader reader, int docId,
       String fieldName, FieldFragList fieldFragList, int maxNumFragments,
       String[] preTags, String[] postTags, Encoder encoder )

providing a version that retrieves text from some external source rather 
than from Lucene fields.

It sounds to me like a really useful modification in Lucene core would 
be to retain match points that have already been computed during scoring 
so the highlighter doesn't have to attempt to reinvent all that logic!  
This has all been discussed at length in LUCENE-1522 already, but is 
there is any recent activity?

My hope is that since (at least in my test) search code seems to spend 
80% of its time highlighting, folks will take up this banner and do the 
plumbing needed to improve it - should lead to huge speed-ups for 
searching!  I'm continuing to read, but not really capable of making a 
meaningful contribution at this point.

-Mike

Re: Extending Solr Highlighter to pull information from external source

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(11/06/20 21:41), Jamie Johnson wrote:
> I am trying to index data where I'm concerned that storing the contents of a
> specific field will be a bit of a hog so we are planning to retrieve this
> information as needed for highlighting from an external source.  I am
> looking to extend the default solr highlighting capability to work with
> information pulled from this external source and it looks like this is
> possible by extending DefaultSolrHighlighter (line 418 to pull a particular
> field from external source) for standard highlighting and
> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just hard
> code this to say if the field name is a specific value look into the
> external source, is this the best way to accomplish this?  Are there any
> other extension points to do what I'm suggesting?
>

There is a couple of long lived issues in jira for this (I'd like to try to search
them, but I couldn't access jira now).

For FVH, it is needed to be modified at Lucene level to use external data.

koji
-- 
http://www.rondhuit.com/en/

Re: Extending Solr Highlighter to pull information from external source

Posted by Mike Sokolov <so...@ifactory.com>.

Yes that sounds about right.  I also have in mind an optimization for 
highlighting so it doesn't need to pull the whole field value.  The fast 
vector highlighter is working with offsets into the field, and should 
work better w/random access into the field value(s).  But that should 
come as a later optimization.

Another thing that bugs me about fvh is that it seems to need to 
recompute all the terms that matched the query for each retrieved field 
value when it seems like it ought to be able to make use of information 
gleaned during the actual query process, but that probably involves some 
deep change to cache that info during query scoring, and that is beyond 
my ken at the moment.

-Mike

On 06/20/2011 10:00 AM, Jamie Johnson wrote:
> perhaps it should be an array that gets returned to be consistent with 
> getValues(fieldName);
>
> On Mon, Jun 20, 2011 at 9:59 AM, Jamie Johnson <jej2003@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Yes, in that case the code becomes
>
>             if(!schemaField.stored()){
>
>
>                 SchemaField keyField = schema.getUniqueKeyField();
>                 String key = doc.getValues(keyField.getName())[0];
>                 docTexts = doc.getValues(fieldName);
>
>                 if(key != null && key.length() > 0){
>                     for(int x = 0; x < docTexts.length; x++){
>                         docTexts[x] = docTexts[x] + " some added text";
>                     }
>                 }
>             }
>
>
>     I'd imagine that we'd want some type of interface to actually pull
>     the text so you can plugin different providers, something like
>
>     ISolrExternalFieldProvider {
>           public String getFieldContent(String key, SchemaField field);
>     }
>
>     not sure if there is anything else that interface should include
>     but that's all I would need at present.
>
>
>
>     On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov
>     <sokolov@ifactory.com <ma...@ifactory.com>> wrote:
>
>         Another option for determining whether to go to external
>         storage would be to examine the SchemaField, see if it is
>         stored, and if not, try to fetch from a file or whatever. 
>         That way you won't have to configure anything.
>
>         -Mike
>
>
>         On 06/20/2011 09:46 AM, Jamie Johnson wrote:
>>         In my case chucking the external storage is simply not an
>>         option.  I'll definitely share anything I find,  the
>>         following is a very simple example of adding text to the
>>         default solr highlighter (had to copy a large portion of the
>>         class since the method that actually does the highlighting is
>>         private along with some classes to get this to run).  If you
>>         look at the source it should hopefully make sense.
>>
>>
>>                 String[] docTexts = null;
>>
>>                 if(fieldName.equals("title")){
>>
>>                     SchemaField keyField = schema.getUniqueKeyField();
>>                     String key =
>>         doc.getValues(keyField.getName())[0];  //I know this field
>>         exists and is not multivalued
>>                     docTexts = doc.getValues(fieldName);  //this
>>         would be loaded from external store, but below just appends
>>         some information
>>                     if(key != null && key.length > 0){
>>                         for(int x = 0; x < docTexts.length; x++){
>>                             docTexts[x] = docTexts[x] + " some added
>>         text";
>>                         }
>>                     }
>>                 }
>>
>>         I have cheated since I know the name of the field that
>>         (title) which I am doing this for but it would probably be
>>         useful to allow this to be set on the highlighter class
>>         through configuration in solrconfig (I'm not familiar at all
>>         with doing this and have spent 0 time looking into it).  Once
>>         configured the if(fieldName.equals("title")) line would be
>>         replaced with something like
>>         if(externalFields.contains(fieldName)){...} or something like
>>         that.
>>
>>         Thoughts/comments?
>>
>>         On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov
>>         <sokolov@ifactory.com <ma...@ifactory.com>> wrote:
>>
>>             I'd be very interested in this, as well, if you do it
>>             before me and are willing to share...
>>
>>             A related question I have tried to ask on this list, and
>>             have never really gotten a good answer to, is whether it
>>             makes sense to just chuck the external storage and treat
>>             the lucene index as the primary storage for documents.  I
>>             have a feeling the answer is no; perhaps because of
>>             increased I/O costs for lucene and solr, but I don't
>>             really know.  I've been considering doing some
>>             experimentation, but would really love an expert opinion...
>>
>>             -Mike
>>
>>
>>             On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>>
>>                 I am trying to index data where I'm concerned that
>>                 storing the contents of a
>>                 specific field will be a bit of a hog so we are
>>                 planning to retrieve this
>>                 information as needed for highlighting from an
>>                 external source.  I am
>>                 looking to extend the default solr highlighting
>>                 capability to work with
>>                 information pulled from this external source and it
>>                 looks like this is
>>                 possible by extending DefaultSolrHighlighter (line
>>                 418 to pull a particular
>>                 field from external source) for standard highlighting and
>>                 BaseFragmentsBuilder (line 99) for
>>                 FastVectorHighlighter.  I could just hard
>>                 code this to say if the field name is a specific
>>                 value look into the
>>                 external source, is this the best way to accomplish
>>                 this?  Are there any
>>                 other extension points to do what I'm suggesting?
>>
>>
>>
>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

perhaps it should be an array that gets returned to be consistent with
getValues(fieldName);

On Mon, Jun 20, 2011 at 9:59 AM, Jamie Johnson <je...@gmail.com> wrote:

> Yes, in that case the code becomes
>
>         if(!schemaField.stored()){
>
>
>             SchemaField keyField = schema.getUniqueKeyField();
>             String key = doc.getValues(keyField.getName())[0];
>             docTexts = doc.getValues(fieldName);
>
>             if(key != null && key.length() > 0){
>                 for(int x = 0; x < docTexts.length; x++){
>                     docTexts[x] = docTexts[x] + " some added text";
>                 }
>             }
>         }
>
>
> I'd imagine that we'd want some type of interface to actually pull the text
> so you can plugin different providers, something like
>
> ISolrExternalFieldProvider {
>       public String getFieldContent(String key, SchemaField field);
> }
>
> not sure if there is anything else that interface should include but that's
> all I would need at present.
>
>
>
> On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov <so...@ifactory.com>wrote:
>
>> **
>> Another option for determining whether to go to external storage would be
>> to examine the SchemaField, see if it is stored, and if not, try to fetch
>> from a file or whatever.  That way you won't have to configure anything.
>>
>> -Mike
>>
>>
>> On 06/20/2011 09:46 AM, Jamie Johnson wrote:
>>
>> In my case chucking the external storage is simply not an option.  I'll
>> definitely share anything I find,  the following is a very simple example of
>> adding text to the default solr highlighter (had to copy a large portion of
>> the class since the method that actually does the highlighting is private
>> along with some classes to get this to run).  If you look at the source it
>> should hopefully make sense.
>>
>>
>>         String[] docTexts = null;
>>
>>         if(fieldName.equals("title")){
>>
>>             SchemaField keyField = schema.getUniqueKeyField();
>>             String key = doc.getValues(keyField.getName())[0];  //I know
>> this field exists and is not multivalued
>>             docTexts = doc.getValues(fieldName);  //this would be loaded
>> from external store, but below just appends some information
>>             if(key != null && key.length > 0){
>>                 for(int x = 0; x < docTexts.length; x++){
>>                     docTexts[x] = docTexts[x] + " some added text";
>>                 }
>>             }
>>         }
>>
>> I have cheated since I know the name of the field that (title) which I am
>> doing this for but it would probably be useful to allow this to be set on
>> the highlighter class through configuration in solrconfig (I'm not familiar
>> at all with doing this and have spent 0 time looking into it).  Once
>> configured the if(fieldName.equals("title")) line would be replaced with
>> something like if(externalFields.contains(fieldName)){...} or something like
>> that.
>>
>> Thoughts/comments?
>>
>> On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <so...@ifactory.com>wrote:
>>
>>> I'd be very interested in this, as well, if you do it before me and are
>>> willing to share...
>>>
>>> A related question I have tried to ask on this list, and have never
>>> really gotten a good answer to, is whether it makes sense to just chuck the
>>> external storage and treat the lucene index as the primary storage for
>>> documents.  I have a feeling the answer is no; perhaps because of increased
>>> I/O costs for lucene and solr, but I don't really know.  I've been
>>> considering doing some experimentation, but would really love an expert
>>> opinion...
>>>
>>> -Mike
>>>
>>>
>>> On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>>>
>>>> I am trying to index data where I'm concerned that storing the contents
>>>> of a
>>>> specific field will be a bit of a hog so we are planning to retrieve
>>>> this
>>>> information as needed for highlighting from an external source.  I am
>>>> looking to extend the default solr highlighting capability to work with
>>>> information pulled from this external source and it looks like this is
>>>> possible by extending DefaultSolrHighlighter (line 418 to pull a
>>>> particular
>>>> field from external source) for standard highlighting and
>>>> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just
>>>> hard
>>>> code this to say if the field name is a specific value look into the
>>>> external source, is this the best way to accomplish this?  Are there any
>>>> other extension points to do what I'm suggesting?
>>>>
>>>>
>>>>
>>>
>>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

Yes, in that case the code becomes

        if(!schemaField.stored()){

            SchemaField keyField = schema.getUniqueKeyField();
            String key = doc.getValues(keyField.getName())[0];
            docTexts = doc.getValues(fieldName);
            if(key != null && key.length() > 0){
                for(int x = 0; x < docTexts.length; x++){
                    docTexts[x] = docTexts[x] + " some added text";
                }
            }
        }


I'd imagine that we'd want some type of interface to actually pull the text
so you can plugin different providers, something like

ISolrExternalFieldProvider {
      public String getFieldContent(String key, SchemaField field);
}

not sure if there is anything else that interface should include but that's
all I would need at present.


On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov <so...@ifactory.com> wrote:

> **
> Another option for determining whether to go to external storage would be
> to examine the SchemaField, see if it is stored, and if not, try to fetch
> from a file or whatever.  That way you won't have to configure anything.
>
> -Mike
>
>
> On 06/20/2011 09:46 AM, Jamie Johnson wrote:
>
> In my case chucking the external storage is simply not an option.  I'll
> definitely share anything I find,  the following is a very simple example of
> adding text to the default solr highlighter (had to copy a large portion of
> the class since the method that actually does the highlighting is private
> along with some classes to get this to run).  If you look at the source it
> should hopefully make sense.
>
>
>         String[] docTexts = null;
>
>         if(fieldName.equals("title")){
>
>             SchemaField keyField = schema.getUniqueKeyField();
>             String key = doc.getValues(keyField.getName())[0];  //I know
> this field exists and is not multivalued
>             docTexts = doc.getValues(fieldName);  //this would be loaded
> from external store, but below just appends some information
>             if(key != null && key.length > 0){
>                 for(int x = 0; x < docTexts.length; x++){
>                     docTexts[x] = docTexts[x] + " some added text";
>                 }
>             }
>         }
>
> I have cheated since I know the name of the field that (title) which I am
> doing this for but it would probably be useful to allow this to be set on
> the highlighter class through configuration in solrconfig (I'm not familiar
> at all with doing this and have spent 0 time looking into it).  Once
> configured the if(fieldName.equals("title")) line would be replaced with
> something like if(externalFields.contains(fieldName)){...} or something like
> that.
>
> Thoughts/comments?
>
> On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <so...@ifactory.com>wrote:
>
>> I'd be very interested in this, as well, if you do it before me and are
>> willing to share...
>>
>> A related question I have tried to ask on this list, and have never really
>> gotten a good answer to, is whether it makes sense to just chuck the
>> external storage and treat the lucene index as the primary storage for
>> documents.  I have a feeling the answer is no; perhaps because of increased
>> I/O costs for lucene and solr, but I don't really know.  I've been
>> considering doing some experimentation, but would really love an expert
>> opinion...
>>
>> -Mike
>>
>>
>> On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>>
>>> I am trying to index data where I'm concerned that storing the contents
>>> of a
>>> specific field will be a bit of a hog so we are planning to retrieve this
>>> information as needed for highlighting from an external source.  I am
>>> looking to extend the default solr highlighting capability to work with
>>> information pulled from this external source and it looks like this is
>>> possible by extending DefaultSolrHighlighter (line 418 to pull a
>>> particular
>>> field from external source) for standard highlighting and
>>> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just
>>> hard
>>> code this to say if the field name is a specific value look into the
>>> external source, is this the best way to accomplish this?  Are there any
>>> other extension points to do what I'm suggesting?
>>>
>>>
>>>
>>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Mike Sokolov <so...@ifactory.com>.

Another option for determining whether to go to external storage would 
be to examine the SchemaField, see if it is stored, and if not, try to 
fetch from a file or whatever.  That way you won't have to configure 
anything.

-Mike

On 06/20/2011 09:46 AM, Jamie Johnson wrote:
> In my case chucking the external storage is simply not an option.  
> I'll definitely share anything I find,  the following is a very simple 
> example of adding text to the default solr highlighter (had to copy a 
> large portion of the class since the method that actually does the 
> highlighting is private along with some classes to get this to run).  
> If you look at the source it should hopefully make sense.
>
>
>         String[] docTexts = null;
>
>         if(fieldName.equals("title")){
>
>             SchemaField keyField = schema.getUniqueKeyField();
>             String key = doc.getValues(keyField.getName())[0];  //I 
> know this field exists and is not multivalued
>             docTexts = doc.getValues(fieldName);  //this would be 
> loaded from external store, but below just appends some information
>             if(key != null && key.length > 0){
>                 for(int x = 0; x < docTexts.length; x++){
>                     docTexts[x] = docTexts[x] + " some added text";
>                 }
>             }
>         }
>
> I have cheated since I know the name of the field that (title) which I 
> am doing this for but it would probably be useful to allow this to be 
> set on the highlighter class through configuration in solrconfig (I'm 
> not familiar at all with doing this and have spent 0 time looking into 
> it).  Once configured the if(fieldName.equals("title")) line would be 
> replaced with something like 
> if(externalFields.contains(fieldName)){...} or something like that.
>
> Thoughts/comments?
>
> On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <sokolov@ifactory.com 
> <ma...@ifactory.com>> wrote:
>
>     I'd be very interested in this, as well, if you do it before me
>     and are willing to share...
>
>     A related question I have tried to ask on this list, and have
>     never really gotten a good answer to, is whether it makes sense to
>     just chuck the external storage and treat the lucene index as the
>     primary storage for documents.  I have a feeling the answer is no;
>     perhaps because of increased I/O costs for lucene and solr, but I
>     don't really know.  I've been considering doing some
>     experimentation, but would really love an expert opinion...
>
>     -Mike
>
>
>     On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>
>         I am trying to index data where I'm concerned that storing the
>         contents of a
>         specific field will be a bit of a hog so we are planning to
>         retrieve this
>         information as needed for highlighting from an external
>         source.  I am
>         looking to extend the default solr highlighting capability to
>         work with
>         information pulled from this external source and it looks like
>         this is
>         possible by extending DefaultSolrHighlighter (line 418 to pull
>         a particular
>         field from external source) for standard highlighting and
>         BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I
>         could just hard
>         code this to say if the field name is a specific value look
>         into the
>         external source, is this the best way to accomplish this?  Are
>         there any
>         other extension points to do what I'm suggesting?
>
>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by Jamie Johnson <je...@gmail.com>.

In my case chucking the external storage is simply not an option.  I'll
definitely share anything I find,  the following is a very simple example of
adding text to the default solr highlighter (had to copy a large portion of
the class since the method that actually does the highlighting is private
along with some classes to get this to run).  If you look at the source it
should hopefully make sense.

        String[] docTexts = null;

        if(fieldName.equals("title")){

            SchemaField keyField = schema.getUniqueKeyField();
            String key = doc.getValues(keyField.getName())[0];  //I know
this field exists and is not multivalued
            docTexts = doc.getValues(fieldName);  //this would be loaded
from external store, but below just appends some information
            if(key != null && key.length > 0){
                for(int x = 0; x < docTexts.length; x++){
                    docTexts[x] = docTexts[x] + " some added text";
                }
            }
        }

I have cheated since I know the name of the field that (title) which I am
doing this for but it would probably be useful to allow this to be set on
the highlighter class through configuration in solrconfig (I'm not familiar
at all with doing this and have spent 0 time looking into it).  Once
configured the if(fieldName.equals("title")) line would be replaced with
something like if(externalFields.contains(fieldName)){...} or something like
that.

Thoughts/comments?

On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <so...@ifactory.com> wrote:

> I'd be very interested in this, as well, if you do it before me and are
> willing to share...
>
> A related question I have tried to ask on this list, and have never really
> gotten a good answer to, is whether it makes sense to just chuck the
> external storage and treat the lucene index as the primary storage for
> documents.  I have a feeling the answer is no; perhaps because of increased
> I/O costs for lucene and solr, but I don't really know.  I've been
> considering doing some experimentation, but would really love an expert
> opinion...
>
> -Mike
>
>
> On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>
>> I am trying to index data where I'm concerned that storing the contents of
>> a
>> specific field will be a bit of a hog so we are planning to retrieve this
>> information as needed for highlighting from an external source.  I am
>> looking to extend the default solr highlighting capability to work with
>> information pulled from this external source and it looks like this is
>> possible by extending DefaultSolrHighlighter (line 418 to pull a
>> particular
>> field from external source) for standard highlighting and
>> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just
>> hard
>> code this to say if the field name is a specific value look into the
>> external source, is this the best way to accomplish this?  Are there any
>> other extension points to do what I'm suggesting?
>>
>>
>>
>

Re: Extending Solr Highlighter to pull information from external source

Posted by François Schiettecatte <fs...@gmail.com>.

Mike

I would be very interested in the answer to that question too. My hunch is that the answer is no too. I have a few text databases that range from 200MB to about 60GB with which I could run some tests. I will have some downtime in early July and will post results.

From what I can tell the Guardian newspaper is doing just that:

	http://www.guardian.co.uk/open-platform/blog/what-is-powering-the-content-api
	http://www.lucidimagination.com/blog/2010/04/29/for-the-guardian-solr-is-the-new-database/

Cheers

François


On Jun 20, 2011, at 9:05 AM, Mike Sokolov wrote:

> I'd be very interested in this, as well, if you do it before me and are willing to share...
> 
> A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the primary storage for documents.  I have a feeling the answer is no; perhaps because of increased I/O costs for lucene and solr, but I don't really know.  I've been considering doing some experimentation, but would really love an expert opinion...
> 
> -Mike
> 
> On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>> I am trying to index data where I'm concerned that storing the contents of a
>> specific field will be a bit of a hog so we are planning to retrieve this
>> information as needed for highlighting from an external source.  I am
>> looking to extend the default solr highlighting capability to work with
>> information pulled from this external source and it looks like this is
>> possible by extending DefaultSolrHighlighter (line 418 to pull a particular
>> field from external source) for standard highlighting and
>> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just hard
>> code this to say if the field name is a specific value look into the
>> external source, is this the best way to accomplish this?  Are there any
>> other extension points to do what I'm suggesting?
>> 
>>

Re: Extending Solr Highlighter to pull information from external source

Posted by Mike Sokolov <so...@ifactory.com>.

I'd be very interested in this, as well, if you do it before me and are 
willing to share...

A related question I have tried to ask on this list, and have never 
really gotten a good answer to, is whether it makes sense to just chuck 
the external storage and treat the lucene index as the primary storage 
for documents.  I have a feeling the answer is no; perhaps because of 
increased I/O costs for lucene and solr, but I don't really know.  I've 
been considering doing some experimentation, but would really love an 
expert opinion...

-Mike

On 06/20/2011 08:41 AM, Jamie Johnson wrote:
> I am trying to index data where I'm concerned that storing the contents of a
> specific field will be a bit of a hog so we are planning to retrieve this
> information as needed for highlighting from an external source.  I am
> looking to extend the default solr highlighting capability to work with
> information pulled from this external source and it looks like this is
> possible by extending DefaultSolrHighlighter (line 418 to pull a particular
> field from external source) for standard highlighting and
> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just hard
> code this to say if the field name is a specific value look into the
> external source, is this the best way to accomplish this?  Are there any
> other extension points to do what I'm suggesting?
>
>