You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2019/10/03 08:24:00 UTC

[jira] [Created] (JAMES-2910) HTML could be indexed directly in ElasticSearch

Benoit Tellier created JAMES-2910:
-------------------------------------

             Summary: HTML could be indexed directly in ElasticSearch
                 Key: JAMES-2910
                 URL: https://issues.apache.org/jira/browse/JAMES-2910
             Project: James Server
          Issue Type: Improvement
          Components: elasticsearch, guice
            Reporter: Benoit Tellier


When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.

This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.

Proposal:

CassandraGuice should default to JsoupTextExtractor when tika is disabled.

This will allow html text extraction to actually happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org