You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/08/10 15:15:34 UTC

Re: MoreLikeThis: How to get quality terms from html from content stream?

Right, a SearchComponent wrapper around some of the Solr Cell  
capabilities could make this so.

On Aug 9, 2009, at 11:21 AM, Jay Hill wrote:

> Solr Cell definitely sounds like it has a place here. But wouldn't  
> it be
> needed for as an extracting component earlier in the process for the
> MoreLikeThisHandler? The MLT Handler works great when it's directed  
> to a
> content stream of plain text. If we could just use Solr Cell to  
> identify the
> file type and do the content extraction earlier in the stream that  
> would do
> the trick I think. Then whether the URL pointed to HTML, a PDF, or  
> whatever,
> MLT would be receiving a stream of extracted content.
>
> -Jay
>
>
> On Sun, Aug 9, 2009 at 7:17 AM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>
>> It's starting to sound like Solr Cell needs a SearchComponent as  
>> well, that
>> can come before the QueryComponent and can be used to map into the  
>> other
>> components.  Essentially, take the functionality of the extractOnly  
>> option
>> and have it feed other SearchComponent.
>>
>>
>>
>> On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:
>>
>>
>>> On Aug 7, 2009, at 5:23pm, Jay Hill wrote:
>>>
>>> I'm using the MoreLikeThisHandler with a content stream to get  
>>> documents
>>>> from my index that match content from an html page like this:
>>>>
>>>> http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi
>>>> ?f=/c/a/2009/08/06/ 
>>>> SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
>>>>
>>>> But, not surprisingly, the query generated is meaningless because  
>>>> a lot
>>>> of
>>>> the markup is picked out as terms:
>>>> <str name="parsedquery_toString">
>>>> body:li body:href  body:div body:class body:a body:script body:type
>>>> body:js
>>>> body:ul body:text body:javascript body:style body:css body:h  
>>>> body:img
>>>> body:var body:articl body:ad body:http body:span body:prop
>>>> </str>
>>>>
>>>> Does anyone know a way to transform the html so that the content  
>>>> can be
>>>> parsed out of the content stream and processed w/o the markup? Or  
>>>> do I
>>>> need
>>>> to write my own HTMLParsingMoreLikeThisHandler?
>>>>
>>>
>>> You'd want to parse the HTML to extract only text first, and use  
>>> that for
>>> your index data.
>>>
>>> Both the Nutch and Tika OSS projects have examples of using HTML  
>>> parsers
>>> (based on TagSoup or CyberNeko) to generate content suitable for  
>>> indexing.
>>>
>>> -- Ken
>>>
>>> If I parse the content out to a plain text file and point the  
>>> stream.url
>>>> param to file:///parsedfile.txt it works great.
>>>>
>>>> -Jay
>>>>
>>>
>>> --------------------------
>>> Ken Krugler
>>> TransPac Software, Inc.
>>> <http://www.transpac.com>
>>> +1 530-210-6378
>>>
>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search