You are viewing a plain text version of this content. The canonical link for it is here.
Posted to slide-dev@jakarta.apache.org by Unico Hommes <un...@hippo.nl> on 2004/10/30 15:30:42 UTC

LuceneContentIndexer

Hi all,

The TextContentIndexer currently uses two ways to add indexed fields to 
the lucene index. Firstly a resource's binary stream is indexed 'as is' 
as text field, secondly each content extractor that applies to the 
current resource is used to add another text field.

I think this is a confusing design. As it is now there are two places 
to include/exclude resources from indexing. Furthermore, the binary 
stream of resources that are to be extracted such as PDF's are also 
indexed, which doesn't make much sense.

Instead I'd prefer only the extractor approach, add a 
TextContentExtractor that simply echoes the contents as is, and add an 
XMLContentExtractor that extracts XML character data to replace the 
current XMLContentIndexer.

Comments?

--
Unico


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org


Re: LuceneContentIndexer

Posted by Unico Hommes <un...@hippo.nl>.
On 1-nov-04, at 12:19, Stefan Lützkendorf wrote:

> Unico Hommes wrote:
>
>> Instead I'd prefer only the extractor approach, add a 
>> TextContentExtractor that simply echoes the contents as is, and add 
>> an XMLContentExtractor that extracts XML character data to replace 
>> the current XMLContentIndexer.
> Sounds good for me.
>
> I currently thought about to create a LuceneContentIndexer in the 
> index.lucene package, that merges the support of extractors in 
> TextContentIndexer with the support of transactions and asynchron 
> indexing in the lucene package.
>
> What do you think?
>

Yeah by all means! Support for asynchronic indexing is a must. I've 
experienced that when the index starts to grow adding and especially 
optimizing the documents starts to become really slow.

--
Unico


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org


Re: LuceneContentIndexer

Posted by Stefan Lützkendorf <lu...@apache.org>.
Unico Hommes wrote:

> 
> Instead I'd prefer only the extractor approach, add a 
> TextContentExtractor that simply echoes the contents as is, and add an 
> XMLContentExtractor that extracts XML character data to replace the 
> current XMLContentIndexer.
Sounds good for me.

I currently thought about to create a LuceneContentIndexer in the 
index.lucene package, that merges the support of extractors in 
TextContentIndexer with the support of transactions and asynchron 
indexing in the lucene package.

What do you think?

Stefan


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org