You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2019/10/15 03:23:00 UTC

[jira] [Closed] (JAMES-2910) HTML could be indexed directly in ElasticSearch

     [ https://issues.apache.org/jira/browse/JAMES-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Tellier closed JAMES-2910.
---------------------------------
    Resolution: Fixed

https://github.com/linagora/james-project/pull/2739 solved this

> HTML could be indexed directly in ElasticSearch
> -----------------------------------------------
>
>                 Key: JAMES-2910
>                 URL: https://issues.apache.org/jira/browse/JAMES-2910
>             Project: James Server
>          Issue Type: Improvement
>          Components: elasticsearch, guice
>    Affects Versions: 3.4.0
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.5.0
>
>
> When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.
> This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.
> Proposal:
> CassandraGuice should default to JsoupTextExtractor when tika is disabled.
> This will allow html text extraction to actually happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org