You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Hans Merkl <hm...@hmerkl.com> on 2010/03/30 23:05:24 UTC

Are there any analyzers for HTML or RTF files?

Hi,
I would like to index formatted files like HTML or RTF in addition to plain
text. My understanding is in order to get the highlighting right I need to
feed the formatted text into Lucene and strip out the HTML or RTF tags with
an analyzer.
Does anybody know if there are analyzers available that can strip out those
tags?

Thanks

Hans

Re: Are there any analyzers for HTML or RTF files?

Posted by pa...@gmail.com.
I'd second this. IFilter is the way to go. We're using it with great efficiency to extract content from doc, xls, pdf, xml and index it thereafter. 
Note that with IFilter you will probably need to make your indexing multithreaded: for some files, extracting content with IFilter simply hangs the thread (it happens, especially for complex xls files), so you'd need to set some maximum indexing time per file, and kill the thread if timeout is exceeded. 
Also, note that IFilters are different for x86 and x64 architecture, and sometimes different for different versions of Windows. It means that you should double check whether IFilters work on your staging and production servers, not only on your development machine. 

As Digy noted, you should probably use some sort of .Net-COM compatibility layer to avoid dealing with COM as much as possible.


Sent from my BlackBerry® wireless device

-----Original Message-----
From: "Digy" <di...@gmail.com>
Date: Wed, 31 Mar 2010 00:34:58 
To: <lu...@lucene.apache.org>
Subject: RE: Are there any analyzers for HTML or RTF files?

No. But you can use IFilter interface to convert any registered app's format
to text. For ex, if you have MS Office installed, than this means, you
already have word2text, excel2text etc. convertors.	

See http://www.codeproject.com/KB/cs/IFilter.aspx

DIGY

-----Original Message-----
From: Hans Merkl [mailto:hm@hmerkl.com] 
Sent: Wednesday, March 31, 2010 12:05 AM
To: lucene-net-user
Subject: Are there any analyzers for HTML or RTF files?

Hi,
I would like to index formatted files like HTML or RTF in addition to plain
text. My understanding is in order to get the highlighting right I need to
feed the formatted text into Lucene and strip out the HTML or RTF tags with
an analyzer.
Does anybody know if there are analyzers available that can strip out those
tags?

Thanks

Hans


Re: Are there any analyzers for HTML or RTF files?

Posted by Hans Merkl <hm...@hmerkl.com>.
DIGY,

Thanks for the pointer. I am still learning about highlighting
and Fastvectorhighlighter looks interesting.

I have integrated Lucene.Net successfully. Now I have to fine tune it. The
capabilities of Lucene.NET are amazing, and also sometimes a bit
overwhelming.

Cheers

Hans

On Wed, Mar 31, 2010 at 12:51, Digy <di...@gmail.com> wrote:

> You can use Fastvectorhighlighter in contrib with
>        "Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS"
>
> It can highlight the search results very well.(It can also hightlight the
> stored doc in text form)
> But as you can see, there is no way (unless you want deal with all sort of
> file formats and write some highlighter code for each of them) to highlight
> the original document.
>
> DIGY
>
>
> -----Original Message-----
> From: Hans Merkl [mailto:hm@hmerkl.com]
> Sent: Wednesday, March 31, 2010 4:03 AM
> To: lucene-net-user
> Subject: Re: Are there any analyzers for HTML or RTF files?
>
> Hi DIGY,
>
> How about if I want to highlight the search results? How does the
> highlighter know the position within the formatted document if I have
> converted it to text before indexing?
>
> Hans
>
> On Tue, Mar 30, 2010 at 17:34, Digy <di...@gmail.com> wrote:
>
> > No. But you can use IFilter interface to convert any registered app's
> > format
> > to text. For ex, if you have MS Office installed, than this means, you
> > already have word2text, excel2text etc. convertors.
> >
> > See http://www.codeproject.com/KB/cs/IFilter.aspx
> >
> > DIGY
> >
> > -----Original Message-----
> > From: Hans Merkl [mailto:hm@hmerkl.com]
> > Sent: Wednesday, March 31, 2010 12:05 AM
> > To: lucene-net-user
> > Subject: Are there any analyzers for HTML or RTF files?
> >
> > Hi,
> > I would like to index formatted files like HTML or RTF in addition to
> plain
> > text. My understanding is in order to get the highlighting right I need
> to
> > feed the formatted text into Lucene and strip out the HTML or RTF tags
> with
> > an analyzer.
> > Does anybody know if there are analyzers available that can strip out
> those
> > tags?
> >
> > Thanks
> >
> > Hans
> >
> >
>
>
> --
> Hans Merkl
> Right On Point, LLC
> 215 Victor Parkway, Suite E
> Annapolis, MD 21403
>
> Phone: (443) 951-4324
> E-Mail: hmerkl@rightonpoint.us
>
>


-- 
Hans Merkl
Right On Point, LLC
215 Victor Parkway, Suite E
Annapolis, MD 21403

Phone: (443) 951-4324
E-Mail: hmerkl@rightonpoint.us

RE: Are there any analyzers for HTML or RTF files?

Posted by Digy <di...@gmail.com>.
You can use Fastvectorhighlighter in contrib with 
	"Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS"

It can highlight the search results very well.(It can also hightlight the
stored doc in text form)
But as you can see, there is no way (unless you want deal with all sort of
file formats and write some highlighter code for each of them) to highlight
the original document.

DIGY


-----Original Message-----
From: Hans Merkl [mailto:hm@hmerkl.com] 
Sent: Wednesday, March 31, 2010 4:03 AM
To: lucene-net-user
Subject: Re: Are there any analyzers for HTML or RTF files?

Hi DIGY,

How about if I want to highlight the search results? How does the
highlighter know the position within the formatted document if I have
converted it to text before indexing?

Hans

On Tue, Mar 30, 2010 at 17:34, Digy <di...@gmail.com> wrote:

> No. But you can use IFilter interface to convert any registered app's
> format
> to text. For ex, if you have MS Office installed, than this means, you
> already have word2text, excel2text etc. convertors.
>
> See http://www.codeproject.com/KB/cs/IFilter.aspx
>
> DIGY
>
> -----Original Message-----
> From: Hans Merkl [mailto:hm@hmerkl.com]
> Sent: Wednesday, March 31, 2010 12:05 AM
> To: lucene-net-user
> Subject: Are there any analyzers for HTML or RTF files?
>
> Hi,
> I would like to index formatted files like HTML or RTF in addition to
plain
> text. My understanding is in order to get the highlighting right I need to
> feed the formatted text into Lucene and strip out the HTML or RTF tags
with
> an analyzer.
> Does anybody know if there are analyzers available that can strip out
those
> tags?
>
> Thanks
>
> Hans
>
>


-- 
Hans Merkl
Right On Point, LLC
215 Victor Parkway, Suite E
Annapolis, MD 21403

Phone: (443) 951-4324
E-Mail: hmerkl@rightonpoint.us


Re: Are there any analyzers for HTML or RTF files?

Posted by Hans Merkl <hm...@hmerkl.com>.
Hi DIGY,

How about if I want to highlight the search results? How does the
highlighter know the position within the formatted document if I have
converted it to text before indexing?

Hans

On Tue, Mar 30, 2010 at 17:34, Digy <di...@gmail.com> wrote:

> No. But you can use IFilter interface to convert any registered app's
> format
> to text. For ex, if you have MS Office installed, than this means, you
> already have word2text, excel2text etc. convertors.
>
> See http://www.codeproject.com/KB/cs/IFilter.aspx
>
> DIGY
>
> -----Original Message-----
> From: Hans Merkl [mailto:hm@hmerkl.com]
> Sent: Wednesday, March 31, 2010 12:05 AM
> To: lucene-net-user
> Subject: Are there any analyzers for HTML or RTF files?
>
> Hi,
> I would like to index formatted files like HTML or RTF in addition to plain
> text. My understanding is in order to get the highlighting right I need to
> feed the formatted text into Lucene and strip out the HTML or RTF tags with
> an analyzer.
> Does anybody know if there are analyzers available that can strip out those
> tags?
>
> Thanks
>
> Hans
>
>


-- 
Hans Merkl
Right On Point, LLC
215 Victor Parkway, Suite E
Annapolis, MD 21403

Phone: (443) 951-4324
E-Mail: hmerkl@rightonpoint.us

RE: Are there any analyzers for HTML or RTF files?

Posted by Digy <di...@gmail.com>.
No. But you can use IFilter interface to convert any registered app's format
to text. For ex, if you have MS Office installed, than this means, you
already have word2text, excel2text etc. convertors.	

See http://www.codeproject.com/KB/cs/IFilter.aspx

DIGY

-----Original Message-----
From: Hans Merkl [mailto:hm@hmerkl.com] 
Sent: Wednesday, March 31, 2010 12:05 AM
To: lucene-net-user
Subject: Are there any analyzers for HTML or RTF files?

Hi,
I would like to index formatted files like HTML or RTF in addition to plain
text. My understanding is in order to get the highlighting right I need to
feed the formatted text into Lucene and strip out the HTML or RTF tags with
an analyzer.
Does anybody know if there are analyzers available that can strip out those
tags?

Thanks

Hans