You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jay Hill <ja...@gmail.com> on 2009/08/08 02:23:42 UTC

MoreLikeThis: How to get quality terms from html from content stream?

I'm using the MoreLikeThisHandler with a content stream to get documents
from my index that match content from an html page like this:
http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true

But, not surprisingly, the query generated is meaningless because a lot of
the markup is picked out as terms:
<str name="parsedquery_toString">
body:li body:href  body:div body:class body:a body:script body:type body:js
body:ul body:text body:javascript body:style body:css body:h body:img
body:var body:articl body:ad body:http body:span body:prop
</str>

Does anyone know a way to transform the html so that the content can be
parsed out of the content stream and processed w/o the markup? Or do I need
to write my own HTMLParsingMoreLikeThisHandler?

If I parse the content out to a plain text file and point the stream.url
param to file:///parsedfile.txt it works great.

-Jay

Re: MoreLikeThis: How to get quality terms from html from content stream?

Posted by Grant Ingersoll <gs...@apache.org>.
Right, a SearchComponent wrapper around some of the Solr Cell  
capabilities could make this so.

On Aug 9, 2009, at 11:21 AM, Jay Hill wrote:

> Solr Cell definitely sounds like it has a place here. But wouldn't  
> it be
> needed for as an extracting component earlier in the process for the
> MoreLikeThisHandler? The MLT Handler works great when it's directed  
> to a
> content stream of plain text. If we could just use Solr Cell to  
> identify the
> file type and do the content extraction earlier in the stream that  
> would do
> the trick I think. Then whether the URL pointed to HTML, a PDF, or  
> whatever,
> MLT would be receiving a stream of extracted content.
>
> -Jay
>
>
> On Sun, Aug 9, 2009 at 7:17 AM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>
>> It's starting to sound like Solr Cell needs a SearchComponent as  
>> well, that
>> can come before the QueryComponent and can be used to map into the  
>> other
>> components.  Essentially, take the functionality of the extractOnly  
>> option
>> and have it feed other SearchComponent.
>>
>>
>>
>> On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:
>>
>>
>>> On Aug 7, 2009, at 5:23pm, Jay Hill wrote:
>>>
>>> I'm using the MoreLikeThisHandler with a content stream to get  
>>> documents
>>>> from my index that match content from an html page like this:
>>>>
>>>> http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi
>>>> ?f=/c/a/2009/08/06/ 
>>>> SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
>>>>
>>>> But, not surprisingly, the query generated is meaningless because  
>>>> a lot
>>>> of
>>>> the markup is picked out as terms:
>>>> <str name="parsedquery_toString">
>>>> body:li body:href  body:div body:class body:a body:script body:type
>>>> body:js
>>>> body:ul body:text body:javascript body:style body:css body:h  
>>>> body:img
>>>> body:var body:articl body:ad body:http body:span body:prop
>>>> </str>
>>>>
>>>> Does anyone know a way to transform the html so that the content  
>>>> can be
>>>> parsed out of the content stream and processed w/o the markup? Or  
>>>> do I
>>>> need
>>>> to write my own HTMLParsingMoreLikeThisHandler?
>>>>
>>>
>>> You'd want to parse the HTML to extract only text first, and use  
>>> that for
>>> your index data.
>>>
>>> Both the Nutch and Tika OSS projects have examples of using HTML  
>>> parsers
>>> (based on TagSoup or CyberNeko) to generate content suitable for  
>>> indexing.
>>>
>>> -- Ken
>>>
>>> If I parse the content out to a plain text file and point the  
>>> stream.url
>>>> param to file:///parsedfile.txt it works great.
>>>>
>>>> -Jay
>>>>
>>>
>>> --------------------------
>>> Ken Krugler
>>> TransPac Software, Inc.
>>> <http://www.transpac.com>
>>> +1 530-210-6378
>>>
>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: MoreLikeThis: How to get quality terms from html from content stream?

Posted by Jay Hill <ja...@gmail.com>.
Solr Cell definitely sounds like it has a place here. But wouldn't it be
needed for as an extracting component earlier in the process for the
MoreLikeThisHandler? The MLT Handler works great when it's directed to a
content stream of plain text. If we could just use Solr Cell to identify the
file type and do the content extraction earlier in the stream that would do
the trick I think. Then whether the URL pointed to HTML, a PDF, or whatever,
MLT would be receiving a stream of extracted content.

-Jay


On Sun, Aug 9, 2009 at 7:17 AM, Grant Ingersoll <gs...@apache.org> wrote:

> It's starting to sound like Solr Cell needs a SearchComponent as well, that
> can come before the QueryComponent and can be used to map into the other
> components.  Essentially, take the functionality of the extractOnly option
> and have it feed other SearchComponent.
>
>
>
> On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:
>
>
>> On Aug 7, 2009, at 5:23pm, Jay Hill wrote:
>>
>>  I'm using the MoreLikeThisHandler with a content stream to get documents
>>> from my index that match content from an html page like this:
>>>
>>> http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi
>>> ?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
>>>
>>> But, not surprisingly, the query generated is meaningless because a lot
>>> of
>>> the markup is picked out as terms:
>>> <str name="parsedquery_toString">
>>> body:li body:href  body:div body:class body:a body:script body:type
>>> body:js
>>> body:ul body:text body:javascript body:style body:css body:h body:img
>>> body:var body:articl body:ad body:http body:span body:prop
>>> </str>
>>>
>>> Does anyone know a way to transform the html so that the content can be
>>> parsed out of the content stream and processed w/o the markup? Or do I
>>> need
>>> to write my own HTMLParsingMoreLikeThisHandler?
>>>
>>
>> You'd want to parse the HTML to extract only text first, and use that for
>> your index data.
>>
>> Both the Nutch and Tika OSS projects have examples of using HTML parsers
>> (based on TagSoup or CyberNeko) to generate content suitable for indexing.
>>
>> -- Ken
>>
>>  If I parse the content out to a plain text file and point the stream.url
>>> param to file:///parsedfile.txt it works great.
>>>
>>> -Jay
>>>
>>
>> --------------------------
>> Ken Krugler
>> TransPac Software, Inc.
>> <http://www.transpac.com>
>> +1 530-210-6378
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: MoreLikeThis: How to get quality terms from html from content stream?

Posted by Grant Ingersoll <gs...@apache.org>.
It's starting to sound like Solr Cell needs a SearchComponent as well,  
that can come before the QueryComponent and can be used to map into  
the other components.  Essentially, take the functionality of the  
extractOnly option and have it feed other SearchComponent.


On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:

>
> On Aug 7, 2009, at 5:23pm, Jay Hill wrote:
>
>> I'm using the MoreLikeThisHandler with a content stream to get  
>> documents
>> from my index that match content from an html page like this:
>> http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi 
>> ?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
>>
>> But, not surprisingly, the query generated is meaningless because a  
>> lot of
>> the markup is picked out as terms:
>> <str name="parsedquery_toString">
>> body:li body:href  body:div body:class body:a body:script body:type  
>> body:js
>> body:ul body:text body:javascript body:style body:css body:h body:img
>> body:var body:articl body:ad body:http body:span body:prop
>> </str>
>>
>> Does anyone know a way to transform the html so that the content  
>> can be
>> parsed out of the content stream and processed w/o the markup? Or  
>> do I need
>> to write my own HTMLParsingMoreLikeThisHandler?
>
> You'd want to parse the HTML to extract only text first, and use  
> that for your index data.
>
> Both the Nutch and Tika OSS projects have examples of using HTML  
> parsers (based on TagSoup or CyberNeko) to generate content suitable  
> for indexing.
>
> -- Ken
>
>> If I parse the content out to a plain text file and point the  
>> stream.url
>> param to file:///parsedfile.txt it works great.
>>
>> -Jay
>
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: MoreLikeThis: How to get quality terms from html from content stream?

Posted by Ken Krugler <kk...@transpac.com>.
On Aug 7, 2009, at 5:23pm, Jay Hill wrote:

> I'm using the MoreLikeThisHandler with a content stream to get  
> documents
> from my index that match content from an html page like this:
> http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi 
> ?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
>
> But, not surprisingly, the query generated is meaningless because a  
> lot of
> the markup is picked out as terms:
> <str name="parsedquery_toString">
> body:li body:href  body:div body:class body:a body:script body:type  
> body:js
> body:ul body:text body:javascript body:style body:css body:h body:img
> body:var body:articl body:ad body:http body:span body:prop
> </str>
>
> Does anyone know a way to transform the html so that the content can  
> be
> parsed out of the content stream and processed w/o the markup? Or do  
> I need
> to write my own HTMLParsingMoreLikeThisHandler?

You'd want to parse the HTML to extract only text first, and use that  
for your index data.

Both the Nutch and Tika OSS projects have examples of using HTML  
parsers (based on TagSoup or CyberNeko) to generate content suitable  
for indexing.

-- Ken

> If I parse the content out to a plain text file and point the  
> stream.url
> param to file:///parsedfile.txt it works great.
>
> -Jay

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378