You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew Cogan <ac...@wordsearchbible.com> on 2010/08/24 16:08:30 UTC

Restricting HTML search?

I'm quite new to SOLR and wondering if the following is possible: in
addition to normal full text search, my users want to have the option to
search only HTML heading innertext, i.e. content inside of <H1>, <H2>, or
<H3> tags. 

 

Thank you,

Andy Cogan

Re: Restricting HTML search?

Posted by Lance Norskog <go...@gmail.com>.

Cool!  I did not know that Tika had a thorough&careful HTML parser.

On Wed, Aug 25, 2010 at 7:49 PM, Ken Krugler
<kk...@transpac.com> wrote:
> Actually TagSoup's reason for existence is to clean up all of the messy HTML
> that's out in the wild.
>
> Tika's HTML parser wraps this, and uses it to generate the stream of SAX
> events that it then consumes and turns into a normalized XHTML 1.0-compliant
> data stream.
>
> -- Ken
>
> On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:
>
>> This assumes that the HTML is good quality. I don't know exactly what
>> your use case is. If you're crawling the web you will find some very
>> screwed-up HTML.
>>
>> On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
>> <kk...@transpac.com> wrote:
>>>
>>> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>>>
>>>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be
>>>> safer?
>>>> I guess it all depends on the "quality" of the source document.
>>>
>>> If you're processing HTML then you definitely want to use something like
>>> NekoHTML or TagSoup.
>>>
>>> Note that Tika uses TagSoup and makes it easy to do special processing of
>>> specific elements - you give it a content handler that gets fed a stream
>>> of
>>> cleaned-up HTML elements.
>>>
>>> -- Ken
>>>
>>>> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>>>>
>>>>> I would do this with regular expressions. There is a Pattern Analyzer
>>>>> and a Tokenizer which do regular expression-based text chopping. (I'm
>>>>> not sure how to make them do what you want). A more precise tool is
>>>>> the RegexTransformer in the DataImportHandler.
>>>>>
>>>>> Lance
>>>>>
>>>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>>>>> <ac...@wordsearchbible.com> wrote:
>>>>>>
>>>>>> I'm quite new to SOLR and wondering if the following is possible: in
>>>>>> addition to normal full text search, my users want to have the option
>>>>>> to
>>>>>> search only HTML heading innertext, i.e. content inside of <H1>, <H2>,
>>>>>> or
>>>>>> <H3> tags.
>>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Restricting HTML search?

Posted by Ken Krugler <kk...@transpac.com>.

Actually TagSoup's reason for existence is to clean up all of the  
messy HTML that's out in the wild.

Tika's HTML parser wraps this, and uses it to generate the stream of  
SAX events that it then consumes and turns into a normalized XHTML 1.0- 
compliant data stream.

-- Ken

On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:

> This assumes that the HTML is good quality. I don't know exactly what
> your use case is. If you're crawling the web you will find some very
> screwed-up HTML.
>
> On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>>
>> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>>
>>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath  
>>> be safer?
>>> I guess it all depends on the "quality" of the source document.
>>
>> If you're processing HTML then you definitely want to use something  
>> like
>> NekoHTML or TagSoup.
>>
>> Note that Tika uses TagSoup and makes it easy to do special  
>> processing of
>> specific elements - you give it a content handler that gets fed a  
>> stream of
>> cleaned-up HTML elements.
>>
>> -- Ken
>>
>>> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>>>
>>>> I would do this with regular expressions. There is a Pattern  
>>>> Analyzer
>>>> and a Tokenizer which do regular expression-based text chopping.  
>>>> (I'm
>>>> not sure how to make them do what you want). A more precise tool is
>>>> the RegexTransformer in the DataImportHandler.
>>>>
>>>> Lance
>>>>
>>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>>>> <ac...@wordsearchbible.com> wrote:
>>>>>
>>>>> I'm quite new to SOLR and wondering if the following is  
>>>>> possible: in
>>>>> addition to normal full text search, my users want to have the  
>>>>> option to
>>>>> search only HTML heading innertext, i.e. content inside of <H1>,  
>>>>> <H2>,
>>>>> or
>>>>> <H3> tags.
>>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
>
> -- 
> Lance Norskog
> goksron@gmail.com

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Restricting HTML search?

Posted by Lance Norskog <go...@gmail.com>.

This assumes that the HTML is good quality. I don't know exactly what
your use case is. If you're crawling the web you will find some very
screwed-up HTML.

On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
<kk...@transpac.com> wrote:
>
> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>
>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer?
>> I guess it all depends on the "quality" of the source document.
>
> If you're processing HTML then you definitely want to use something like
> NekoHTML or TagSoup.
>
> Note that Tika uses TagSoup and makes it easy to do special processing of
> specific elements - you give it a content handler that gets fed a stream of
> cleaned-up HTML elements.
>
> -- Ken
>
>> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>>
>>> I would do this with regular expressions. There is a Pattern Analyzer
>>> and a Tokenizer which do regular expression-based text chopping. (I'm
>>> not sure how to make them do what you want). A more precise tool is
>>> the RegexTransformer in the DataImportHandler.
>>>
>>> Lance
>>>
>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>>> <ac...@wordsearchbible.com> wrote:
>>>>
>>>> I'm quite new to SOLR and wondering if the following is possible: in
>>>> addition to normal full text search, my users want to have the option to
>>>> search only HTML heading innertext, i.e. content inside of <H1>, <H2>,
>>>> or
>>>> <H3> tags.
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Restricting HTML search?

Posted by Ken Krugler <kk...@transpac.com>.

On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:

> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be  
> safer?
> I guess it all depends on the "quality" of the source document.

If you're processing HTML then you definitely want to use something  
like NekoHTML or TagSoup.

Note that Tika uses TagSoup and makes it easy to do special processing  
of specific elements - you give it a content handler that gets fed a  
stream of cleaned-up HTML elements.

-- Ken

> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>
>> I would do this with regular expressions. There is a Pattern Analyzer
>> and a Tokenizer which do regular expression-based text chopping. (I'm
>> not sure how to make them do what you want). A more precise tool is
>> the RegexTransformer in the DataImportHandler.
>>
>> Lance
>>
>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>> <ac...@wordsearchbible.com> wrote:
>>> I'm quite new to SOLR and wondering if the following is possible: in
>>> addition to normal full text search, my users want to have the  
>>> option to
>>> search only HTML heading innertext, i.e. content inside of <H1>,  
>>> <H2>, or
>>> <H3> tags.
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Restricting HTML search?

Posted by Paul Libbrecht <pa...@activemath.org>.

Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be  
safer?
I guess it all depends on the "quality" of the source document.

paul


Le 25-août-10 à 02:09, Lance Norskog a écrit :

> I would do this with regular expressions. There is a Pattern Analyzer
> and a Tokenizer which do regular expression-based text chopping. (I'm
> not sure how to make them do what you want). A more precise tool is
> the RegexTransformer in the DataImportHandler.
>
> Lance
>
> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
> <ac...@wordsearchbible.com> wrote:
>> I'm quite new to SOLR and wondering if the following is possible: in
>> addition to normal full text search, my users want to have the  
>> option to
>> search only HTML heading innertext, i.e. content inside of <H1>,  
>> <H2>, or
>> <H3> tags.

Re: Restricting HTML search?

Posted by Lance Norskog <go...@gmail.com>.

I would do this with regular expressions. There is a Pattern Analyzer
and a Tokenizer which do regular expression-based text chopping. (I'm
not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
<ac...@wordsearchbible.com> wrote:
> I'm quite new to SOLR and wondering if the following is possible: in
> addition to normal full text search, my users want to have the option to
> search only HTML heading innertext, i.e. content inside of <H1>, <H2>, or
> <H3> tags.
>
>
>
> Thank you,
>
> Andy Cogan
>
>
>
>

-- 
Lance Norskog
goksron@gmail.com