You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Scott Yeadon <sc...@anu.edu.au> on 2010/10/01 04:00:21 UTC

PHP Solr API

  Hi,

I have inherited an application which uses Solr search and the PHP Solr 
API (http://pecl.php.net/package/solr). While the list of search results 
with appropriate highlighting is all good, when selecting a result that 
navigates to an individual article the users want to have all the hits 
highlighted in the full text.

The problem is that the article text is HTML and Solr appears to strip 
the HTML by default. The highlight snippets contain no formatting and 
neither does the "stored" version of the text. This means that using a 
large snippet size and using the returned text as the article text is 
not satisfactory, nor is using the stored version returned by the return 
response.

Obtaining offset information from the search and applying the 
highlighting myself within the webapp using the HTML version would be 
fine, but the offsets will be wrong due to the stripping of the tags. 
Does anyone have any advice on how I might get this to work, it doesn't 
seem to be a particularly unusual use case yet I could not find 
information on how to achieve it. It's likely I'm overlooking something 
simple. Anyone have any advice?

Thanks.

Scott.

Re: PHP Solr API

Posted by Scott Yeadon <sc...@anu.edu.au>.
  Thanks, but I still need to "store" text at any rate in order to get 
the highlighted snippets for the search results list. This isn't a 
problem. The issue is how to obtain correct offsets or other mechanisms 
for being able to display the original HTML text plus term highlighting 
when navigating to an individual search result.

Scott.

On 1/10/10 12:53 PM, Neil Lunn wrote:
> On Fri, 2010-10-01 at 12:00 +1000, Scott Yeadon wrote:
>> Hi,
>>
>> The problem is that the article text is HTML and Solr appears to strip
>> the HTML by default.
> I think what you need to look at is how the fields are defined by
> default in your schema. If Data sent as HTML is being added to the
> standard html-text type and stored then the html is stripped and words
> indexed by default. If you want to store the raw html then maybe you
> should be doing that and not storing the stripped version, just indexing
> it.
>


Re: PHP Solr API

Posted by Neil Lunn <ne...@trixan.com>.
On Fri, 2010-10-01 at 12:00 +1000, Scott Yeadon wrote:
> Hi,
> 

> The problem is that the article text is HTML and Solr appears to strip 
> the HTML by default.

I think what you need to look at is how the fields are defined by
default in your schema. If Data sent as HTML is being added to the
standard html-text type and stored then the html is stripped and words
indexed by default. If you want to store the raw html then maybe you
should be doing that and not storing the stripped version, just indexing
it.

-- 


Regards,

Neil Lunn