You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Clegg <an...@gmail.com> on 2010/06/06 21:16:04 UTC

Indexing link targets in HTML fragments

Hi Solr gurus,

I'm wondering if there is an easy way to keep the targets of hyperlinks from
a field which may contain HTML fragments, while stripping the HTML.

e.g. if I had a field that looked like this:

"This is the entire content of my field, but  http://example.com/ some of
the words  are a hyperlink."

Then I'd like to keep "http://example.com/" as a single token (along with
all of the actual words) but not the "a" and "href", giving me:

"This is the entire content of my field but http://example.com/ some of the
words are a hyperlink"

I'm thinking that since we're dealing with individual fragments rather than
entire HTML pages, Tika/SolrCell may be poorly suited and/or too heavyweight
-- but please correct me if I'm wrong.

Maybe something using regular expressions? Does anyone have a code snippet
they could share?

Many thanks,

Andrew.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p874547.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing link targets in HTML fragments

Posted by Andrew Clegg <an...@gmail.com>.

findbestopensource wrote:
> 
> Could you tell us your schema used for indexing. In my opinion, using
> standardanalyzer / Snowball analyzer will do the best. They will not break
> the URLs. Add href, and other related html tags as part of stop words and
> it
> will removed while indexing.
> 

This project's still in the planning stages -- I haven't designed the
pipeline yet.

But you're right, maybe starting with everything and just stopping out the
tag and attribute names is the most fail-safe approach.

Then at least if I get something wrong I won't miss anything. Worst case
scenario, I just end up with some extra terms in the index.

Thanks,

Andrew.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p876343.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing link targets in HTML fragments

Posted by findbestopensource <fi...@gmail.com>.
Could you tell us your schema used for indexing. In my opinion, using
standardanalyzer / Snowball analyzer will do the best. They will not break
the URLs. Add href, and other related html tags as part of stop words and it
will removed while indexing.

Regards
Aditya
www.findbestopensource.com


On Mon, Jun 7, 2010 at 12:20 PM, Andrew Clegg <an...@gmail.com>wrote:

>
>
> Lance Norskog-2 wrote:
> >
> > The PatternReplace and HTMPStrip tokenizers might be the right bet.
> > The easiest way to go about this is to make a bunch of text fields
> > with different analysis stacks and investigate them in the Scema
> > Browser. You can paste an HTML document into the text box and see
> > exactly how the words & markup get torn apart.
> >
>
> Thanks Lance, I'll experiment.
>
> For reference, for anyone else who comes across this thread -- the html in
> my original post might have got munged on the way into or out of the list
> server. It was supposed to look like this:
>
> This is the entire content of my field, but [a
> href="http://example.com/"]some of the words[/a] are a hyperlink.
>
> (but with real html tags instead of the square brackets)
>
> and I am just trying to extract the words and the link target but lose the
> rest of the markup.
>
> Cheers,
>
> Andrew.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing link targets in HTML fragments

Posted by Andrew Clegg <an...@gmail.com>.

Lance Norskog-2 wrote:
> 
> The PatternReplace and HTMPStrip tokenizers might be the right bet.
> The easiest way to go about this is to make a bunch of text fields
> with different analysis stacks and investigate them in the Scema
> Browser. You can paste an HTML document into the text box and see
> exactly how the words & markup get torn apart.
> 

Thanks Lance, I'll experiment.

For reference, for anyone else who comes across this thread -- the html in
my original post might have got munged on the way into or out of the list
server. It was supposed to look like this:

This is the entire content of my field, but [a
href="http://example.com/"]some of the words[/a] are a hyperlink.

(but with real html tags instead of the square brackets)

and I am just trying to extract the words and the link target but lose the
rest of the markup.

Cheers,

Andrew.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing link targets in HTML fragments

Posted by Lance Norskog <go...@gmail.com>.
The PatternReplace and HTMPStrip tokenizers might be the right bet.
The easiest way to go about this is to make a bunch of text fields
with different analysis stacks and investigate them in the Scema
Browser. You can paste an HTML document into the text box and see
exactly how the words & markup get torn apart.

On 6/6/10, Andrew Clegg <an...@gmail.com> wrote:
>
> Hi Solr gurus,
>
> I'm wondering if there is an easy way to keep the targets of hyperlinks from
> a field which may contain HTML fragments, while stripping the HTML.
>
> e.g. if I had a field that looked like this:
>
> "This is the entire content of my field, but  http://example.com/ some of
> the words  are a hyperlink."
>
> Then I'd like to keep "http://example.com/" as a single token (along with
> all of the actual words) but not the "a" and "href", giving me:
>
> "This is the entire content of my field but http://example.com/ some of the
> words are a hyperlink"
>
> I'm thinking that since we're dealing with individual fragments rather than
> entire HTML pages, Tika/SolrCell may be poorly suited and/or too heavyweight
> -- but please correct me if I'm wrong.
>
> Maybe something using regular expressions? Does anyone have a code snippet
> they could share?
>
> Many thanks,
>
> Andrew.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p874547.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 
Lance Norskog
goksron@gmail.com