You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2019/10/15 03:23:00 UTC
[jira] [Closed] (JAMES-2910) HTML could be indexed directly in
ElasticSearch
[ https://issues.apache.org/jira/browse/JAMES-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Tellier closed JAMES-2910.
---------------------------------
Resolution: Fixed
https://github.com/linagora/james-project/pull/2739 solved this
> HTML could be indexed directly in ElasticSearch
> -----------------------------------------------
>
> Key: JAMES-2910
> URL: https://issues.apache.org/jira/browse/JAMES-2910
> Project: James Server
> Issue Type: Improvement
> Components: elasticsearch, guice
> Affects Versions: 3.4.0
> Reporter: Benoit Tellier
> Priority: Major
> Fix For: 3.5.0
>
>
> When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.
> This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.
> Proposal:
> CassandraGuice should default to JsoupTextExtractor when tika is disabled.
> This will allow html text extraction to actually happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org