You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Martin Líška <dj...@gmail.com> on 2011/06/20 14:40:58 UTC

How to add unextracted field when using Sorl Cell

Hello,

I would like to transform my existing Lucene application to Solr but I'm
struggling with one thing (most important though).
I would like to index XHTML files using ExtractingRequestHandler - no
problem with that. But, I have a custom Tokenizer which expects well formed
xml (whole xhtml document preferably) and produces certain tokens with
payloads for Lucene. I've added this tokenizer to Solr as a plugin, added
required schema.xml entries (own field type which uses this Tokenizer and a
field that uses this type) and everything works fine in Solr admin analysis.
I am having a hard time going through the Solr Cell API and sources finding
out how to incorporate creation of such custom field. What I would like to
do, I guess, is to be able to recognize the input document type (this is
already done somewhere) and when it is XHTML file, I would like to add a
custom field to SolrDocument that uses certain schema.xml field definition
and to feed it with the whole InputStream of a input document.
I hope it is clear enough.
Can somebody point me in the right direction how to achieve this?

Thank you,

Martin