You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2019/10/03 08:24:00 UTC
[jira] [Created] (JAMES-2910) HTML could be indexed directly in
ElasticSearch
Benoit Tellier created JAMES-2910:
-------------------------------------
Summary: HTML could be indexed directly in ElasticSearch
Key: JAMES-2910
URL: https://issues.apache.org/jira/browse/JAMES-2910
Project: James Server
Issue Type: Improvement
Components: elasticsearch, guice
Reporter: Benoit Tellier
When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.
This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.
Proposal:
CassandraGuice should default to JsoupTextExtractor when tika is disabled.
This will allow html text extraction to actually happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org