You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Hank Knight <hk...@gmail.com> on 2014/04/07 22:29:26 UTC

couchdb-lucene: ignore certain elements of HTML attachments

Using couchdb-lucene is there a way to ignore all content inside a
blacklisted element of HTML attachments?  Certain common information
is found in the header of every HTML document, including links to
other pages, and it would be ideal for these common areas not to be
searched.

<header>Hello</header>
<div id="header">Hello</div>

Re: couchdb-lucene: ignore certain elements of HTML attachments

Posted by Robert Samuel Newson <rn...@apache.org>.
Not at present but if Tika has such an option it should be easy to expose.

B.

On 7 Apr 2014, at 21:29, Hank Knight <hk...@gmail.com> wrote:

> Using couchdb-lucene is there a way to ignore all content inside a
> blacklisted element of HTML attachments?  Certain common information
> is found in the header of every HTML document, including links to
> other pages, and it would be ideal for these common areas not to be
> searched.
> 
> <header>Hello</header>
> <div id="header">Hello</div>