You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2019/10/03 08:25:00 UTC
[jira] [Updated] (JAMES-2910) HTML could be indexed directly in
ElasticSearch
[ https://issues.apache.org/jira/browse/JAMES-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Tellier updated JAMES-2910:
----------------------------------
Affects Version/s: 3.4.0
> HTML could be indexed directly in ElasticSearch
> -----------------------------------------------
>
> Key: JAMES-2910
> URL: https://issues.apache.org/jira/browse/JAMES-2910
> Project: James Server
> Issue Type: Improvement
> Components: elasticsearch, guice
> Affects Versions: 3.4.0
> Reporter: Benoit Tellier
> Priority: Major
>
> When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.
> This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.
> Proposal:
> CassandraGuice should default to JsoupTextExtractor when tika is disabled.
> This will allow html text extraction to actually happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org