You are viewing a plain text version of this content. The canonical link for it is here.
Posted to slide-dev@jakarta.apache.org by Unico Hommes <un...@hippo.nl> on 2004/10/30 15:30:42 UTC
LuceneContentIndexer
Hi all,
The TextContentIndexer currently uses two ways to add indexed fields to
the lucene index. Firstly a resource's binary stream is indexed 'as is'
as text field, secondly each content extractor that applies to the
current resource is used to add another text field.
I think this is a confusing design. As it is now there are two places
to include/exclude resources from indexing. Furthermore, the binary
stream of resources that are to be extracted such as PDF's are also
indexed, which doesn't make much sense.
Instead I'd prefer only the extractor approach, add a
TextContentExtractor that simply echoes the contents as is, and add an
XMLContentExtractor that extracts XML character data to replace the
current XMLContentIndexer.
Comments?
--
Unico
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org
Re: LuceneContentIndexer
Posted by Unico Hommes <un...@hippo.nl>.
On 1-nov-04, at 12:19, Stefan Lützkendorf wrote:
> Unico Hommes wrote:
>
>> Instead I'd prefer only the extractor approach, add a
>> TextContentExtractor that simply echoes the contents as is, and add
>> an XMLContentExtractor that extracts XML character data to replace
>> the current XMLContentIndexer.
> Sounds good for me.
>
> I currently thought about to create a LuceneContentIndexer in the
> index.lucene package, that merges the support of extractors in
> TextContentIndexer with the support of transactions and asynchron
> indexing in the lucene package.
>
> What do you think?
>
Yeah by all means! Support for asynchronic indexing is a must. I've
experienced that when the index starts to grow adding and especially
optimizing the documents starts to become really slow.
--
Unico
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org
Re: LuceneContentIndexer
Posted by Stefan Lützkendorf <lu...@apache.org>.
Unico Hommes wrote:
>
> Instead I'd prefer only the extractor approach, add a
> TextContentExtractor that simply echoes the contents as is, and add an
> XMLContentExtractor that extracts XML character data to replace the
> current XMLContentIndexer.
Sounds good for me.
I currently thought about to create a LuceneContentIndexer in the
index.lucene package, that merges the support of extractors in
TextContentIndexer with the support of transactions and asynchron
indexing in the lucene package.
What do you think?
Stefan
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org