You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2019/10/03 08:25:00 UTC

[jira] [Updated] (JAMES-2910) HTML could be indexed directly in ElasticSearch

     [ https://issues.apache.org/jira/browse/JAMES-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Tellier updated JAMES-2910:
----------------------------------
    Affects Version/s: 3.4.0

> HTML could be indexed directly in ElasticSearch
> -----------------------------------------------
>
>                 Key: JAMES-2910
>                 URL: https://issues.apache.org/jira/browse/JAMES-2910
>             Project: James Server
>          Issue Type: Improvement
>          Components: elasticsearch, guice
>    Affects Versions: 3.4.0
>            Reporter: Benoit Tellier
>            Priority: Major
>
> When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.
> This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.
> Proposal:
> CassandraGuice should default to JsoupTextExtractor when tika is disabled.
> This will allow html text extraction to actually happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org