You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alexandre Rafalovitch (JIRA)" <ji...@apache.org> on 2013/03/05 13:19:12 UTC

[jira] [Created] (SOLR-4530) DIH: Provide configuration to use Tika's IdentityHtmlMapper

Alexandre Rafalovitch created SOLR-4530:
-------------------------------------------

             Summary: DIH: Provide configuration to use Tika's IdentityHtmlMapper
                 Key: SOLR-4530
                 URL: https://issues.apache.org/jira/browse/SOLR-4530
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 4.1
            Reporter: Alexandre Rafalovitch
            Priority: Minor
             Fix For: 4.2


When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most of the HTML. It may make sense when the expectation is just to store the extracted content as a text blob, but DIH allows more fine-tuned content extraction (e.g. with nested XPathEntityProcessor).

Recent Tika versions allow to set an alternative HTML Mapper implementation that passes all the HTML in. It would be useful to be able to set that implementation from DIH configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org