You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alexandre Rafalovitch (JIRA)" <ji...@apache.org> on 2013/03/21 03:51:16 UTC

[jira] [Updated] (SOLR-4530) DIH: Provide configuration to use Tika's IdentityHtmlMapper

     [ https://issues.apache.org/jira/browse/SOLR-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexandre Rafalovitch updated SOLR-4530:
----------------------------------------

    Attachment: SOLR-4530.patch

Patch against trunk.
                
> DIH: Provide configuration to use Tika's IdentityHtmlMapper
> -----------------------------------------------------------
>
>                 Key: SOLR-4530
>                 URL: https://issues.apache.org/jira/browse/SOLR-4530
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 4.1
>            Reporter: Alexandre Rafalovitch
>            Priority: Minor
>             Fix For: 4.3
>
>         Attachments: SOLR-4530.patch
>
>
> When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most of the HTML. It may make sense when the expectation is just to store the extracted content as a text blob, but DIH allows more fine-tuned content extraction (e.g. with nested XPathEntityProcessor).
> Recent Tika versions allow to set an alternative HTML Mapper implementation that passes all the HTML in. It would be useful to be able to set that implementation from DIH configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org