You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Raphaël Ouazana-Sustowski <ro...@apache.org> on 2020/06/11 16:01:54 UTC

Distributed James: make ElasticSearch indexing optional?

Hi,

Here is a proposal to make ElasticSearch optional in our distributed
product/flavor/server.

Comments are welcome.

## Why?

Some people have expressed the need of using a distributed James without
ElasticSearch:
- in some comment here: https://issues.apache.org/jira/browse/JAMES-3086
- one of our customers plan to deploy a distributed James server for
serving POP3 encrypted emails. This deployment does not rely on
searching features. However as part of current Distributed James server
he is forced to rely on ElasticSearch email indexing.

This results in wasted resources as maintaining an ElasticSearch cluster
to keep up with the volume is expensive.
Maintaining an ElasticSearch cluster when not needed is costly at
several levels:
- cost of infrastructure to deploy it
- cost of people having to maintain it
- performance cost on James to unnecessarily index data

## How ?

Scanning search is a search implementation that is running on top of any
mailbox implementation, even distributed ones and does not require to
index data.

Scanning Search is tested both at the component level (unit test) but
also passes IMAP (MPT) tests on top of Cassandra implementation, as well
as JMAP memory tests, thus delivers correct results. Of course it does
not support full text search.

We should allow Distributed James to optionally rely on scanning search
instead of ElasticSearch.

- Scanning search should be advised for deployments rarely searching data
- ElasticSearch should be advised when search is frequent or requires
high performance

We could use module choosing [1] to choose between scanning search and
ElasticSearch.

To be noted that scanning search introduces no other dependencies as it
is part of mailbox-store thus causes no risk of library clashes.

To be noted also that metric collection and log collection using
ElasticSearch is unaffected.

## Alternative

The alternative would be to build a different product/flavor/server than
the distributed one, where the only difference with the distributed one
is that indexing will rely on scanning instead of ElasticSearch.

The maintenance cost of such a product/flavor/server is higher than of a
configuration option (Docker images to release, time and energy to run
integration tests on it).

Such a product/flavor is hard to brand because even if it answers a
need, it is not so far of the distributed one, and does not answer needs
that are very far from it neither.

The advantage is that is would allow to more fine tune this solution to
answer to the exact needs.

## Work in Progress

See pull request: https://github.com/linagora/james-project/pull/3425

Regards,

Raphaël.

[1]
https://github.com/apache/james-project/blob/master/src/adr/0036-against-use-of-conditional-statements-in-guice-modules.md

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Eugen Stan <eu...@netdava.com>.

+1

Making complexity OPT-IN is a nice touch.

La 12.06.2020 05:06, Tellier Benoit a scris:
> +1
>
> On 11/06/2020 23:01, Raphaël Ouazana-Sustowski wrote:
>> Hi,
>>
>> Here is a proposal to make ElasticSearch optional in our distributed
>> product/flavor/server.
>>
>> Comments are welcome.
>>
>>
>> ## Why?
>>
>> Some people have expressed the need of using a distributed James without
>> ElasticSearch:
>> - in some comment here: https://issues.apache.org/jira/browse/JAMES-3086
>> - one of our customers plan to deploy a distributed James server for
>> serving POP3 encrypted emails. This deployment does not rely on
>> searching features. However as part of current Distributed James server
>> he is forced to rely on ElasticSearch email indexing.
>>
>> This results in wasted resources as maintaining an ElasticSearch cluster
>> to keep up with the volume is expensive.
>> Maintaining an ElasticSearch cluster when not needed is costly at
>> several levels:
>> - cost of infrastructure to deploy it
>> - cost of people having to maintain it
>> - performance cost on James to unnecessarily index data
>>
>> ## How ?
>>
>> Scanning search is a search implementation that is running on top of any
>> mailbox implementation, even distributed ones and does not require to
>> index data.
>>
>> Scanning Search is tested both at the component level (unit test) but
>> also passes IMAP (MPT) tests on top of Cassandra implementation, as well
>> as JMAP memory tests, thus delivers correct results. Of course it does
>> not support full text search.
>>
>> We should allow Distributed James to optionally rely on scanning search
>> instead of ElasticSearch.
>>
>>  - Scanning search should be advised for deployments rarely searching data
>>  - ElasticSearch should be advised when search is frequent or requires
>> high performance
>>
>> We could use module choosing [1] to choose between scanning search and
>> ElasticSearch.
>>
>> To be noted that scanning search introduces no other dependencies as it
>> is part of mailbox-store thus causes no risk of library clashes.
>>
>> To be noted also that metric collection and log collection using
>> ElasticSearch is unaffected.
>>
>> ## Alternative
>>
>> The alternative would be to build a different product/flavor/server than
>> the distributed one, where the only difference with the distributed one
>> is that indexing will rely on scanning instead of ElasticSearch.
>>
>> The maintenance cost of such a product/flavor/server is higher than of a
>> configuration option (Docker images to release, time and energy to run
>> integration tests on it).
>>
>> Such a product/flavor is hard to brand because even if it answers a
>> need, it is not so far of the distributed one, and does not answer needs
>> that are very far from it neither.
>>
>> The advantage is that is would allow to more fine tune this solution to
>> answer to the exact needs.
>>
>> ## Work in Progress
>>
>> See pull request: https://github.com/linagora/james-project/pull/3425
>>
>> Regards,
>>
>> Raphaël.
>>
>>
>>
>> [1]
>> https://github.com/apache/james-project/blob/master/src/adr/0036-against-use-of-conditional-statements-in-guice-modules.md
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
>> For additional commands, e-mail: server-dev-help@james.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>
-- 
Eugen Stan
+40720 898 747 / netdava.com

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Tellier Benoit <bt...@apache.org>.

+1

On 11/06/2020 23:01, Raphaël Ouazana-Sustowski wrote:
> Hi,
> 
> Here is a proposal to make ElasticSearch optional in our distributed
> product/flavor/server.
> 
> Comments are welcome.
> 
> 
> ## Why?
> 
> Some people have expressed the need of using a distributed James without
> ElasticSearch:
> - in some comment here: https://issues.apache.org/jira/browse/JAMES-3086
> - one of our customers plan to deploy a distributed James server for
> serving POP3 encrypted emails. This deployment does not rely on
> searching features. However as part of current Distributed James server
> he is forced to rely on ElasticSearch email indexing.
> 
> This results in wasted resources as maintaining an ElasticSearch cluster
> to keep up with the volume is expensive.
> Maintaining an ElasticSearch cluster when not needed is costly at
> several levels:
> - cost of infrastructure to deploy it
> - cost of people having to maintain it
> - performance cost on James to unnecessarily index data
> 
> ## How ?
> 
> Scanning search is a search implementation that is running on top of any
> mailbox implementation, even distributed ones and does not require to
> index data.
> 
> Scanning Search is tested both at the component level (unit test) but
> also passes IMAP (MPT) tests on top of Cassandra implementation, as well
> as JMAP memory tests, thus delivers correct results. Of course it does
> not support full text search.
> 
> We should allow Distributed James to optionally rely on scanning search
> instead of ElasticSearch.
> 
>  - Scanning search should be advised for deployments rarely searching data
>  - ElasticSearch should be advised when search is frequent or requires
> high performance
> 
> We could use module choosing [1] to choose between scanning search and
> ElasticSearch.
> 
> To be noted that scanning search introduces no other dependencies as it
> is part of mailbox-store thus causes no risk of library clashes.
> 
> To be noted also that metric collection and log collection using
> ElasticSearch is unaffected.
> 
> ## Alternative
> 
> The alternative would be to build a different product/flavor/server than
> the distributed one, where the only difference with the distributed one
> is that indexing will rely on scanning instead of ElasticSearch.
> 
> The maintenance cost of such a product/flavor/server is higher than of a
> configuration option (Docker images to release, time and energy to run
> integration tests on it).
> 
> Such a product/flavor is hard to brand because even if it answers a
> need, it is not so far of the distributed one, and does not answer needs
> that are very far from it neither.
> 
> The advantage is that is would allow to more fine tune this solution to
> answer to the exact needs.
> 
> ## Work in Progress
> 
> See pull request: https://github.com/linagora/james-project/pull/3425
> 
> Regards,
> 
> Raphaël.
> 
> 
> 
> [1]
> https://github.com/apache/james-project/blob/master/src/adr/0036-against-use-of-conditional-statements-in-guice-modules.md
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Eugen Stan <eu...@netdava.com>.

Hi,

I've looking again at this thread and I see strong arguments over a
technical issue.

The proposal in question is whether we should add a configuration switch
to OPT-IN/OUT of a large, useful dependency.

This is a technical decision that has implications mostly for users and
less for developers - since the code seems simple to follow.

@Matthieu: To me it appears like this:

- It means that users can do it via a configuration switch, if they
chose too.

- It does not mean it should be disabled by default ( it's for another
discussion)

- The configuration code looks easy to maintain

Let's leave what users will want or not out of the discussion since we
can't speak for them anyway.

We can only say what we would do as users or what the users we worked
with already did.

Looking at the code size and changes for this feature and the benefits
of having the option to disable search (faster initial setup, solution
for niche use-cases ) I think the PR can be merged.

La 23.06.2020 11:38, Matthieu Baechler a scris:
> Hi,
>
> On Wed, 2020-06-17 at 17:53 +0200, Raphaël Ouazana-Sustowski wrote:
>> Hi Matthieu,
>>
>> I don't see much new arguments in your last answer.
> It's weird because I think I didn't really repeat what I said
> previously.
>
>>  I can answer your 
>> questions one by one, but I would like to go forward.
> It's your freedom of course, it's a bit sad that such a debate don't
> end in a consensus but it happens.
>
>> Would it make a consensus for you if we work on merging the current
>> PR, 
>> with always the option to revert it and go to a product if needed?
> No, that's not what I call consensus as it doesn't take into account my
> opinion.
>
>> Or do you prefer that we ask for a vote?
> Feel free to do it or not. I won't stand in your way.
>
> Cheers,
>
> -- Matthieu Baechler
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>
-- 
Eugen Stan
+40720 898 747 / netdava.com

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Matthieu Baechler <ma...@apache.org>.

Hi,

On Wed, 2020-06-17 at 17:53 +0200, Raphaël Ouazana-Sustowski wrote:
> Hi Matthieu,
> 
> I don't see much new arguments in your last answer.

It's weird because I think I didn't really repeat what I said
previously.

>  I can answer your 
> questions one by one, but I would like to go forward.

It's your freedom of course, it's a bit sad that such a debate don't
end in a consensus but it happens.

> Would it make a consensus for you if we work on merging the current
> PR, 
> with always the option to revert it and go to a product if needed?

No, that's not what I call consensus as it doesn't take into account my
opinion.

> Or do you prefer that we ask for a vote?

Feel free to do it or not. I won't stand in your way.

Cheers,

-- Matthieu Baechler

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Raphaël Ouazana-Sustowski <ro...@apache.org>.

Hi Matthieu,

I don't see much new arguments in your last answer. I can answer your 
questions one by one, but I would like to go forward.

Would it make a consensus for you if we work on merging the current PR, 
with always the option to revert it and go to a product if needed?

Or do you prefer that we ask for a vote?

Cheers,

Raphaël.

Le 15/06/2020 à 16:15, Matthieu Baechler a écrit :
> On Mon, 2020-06-15 at 15:30 +0200, Raphaël Ouazana-Sustowski wrote:
> [...]
>
>>>> I see many use cases where you would not need search, essentially
>>>> based
>>>> on automatic mail processing, which is a common James workflow.
>>> Does it still make sense to support IMAP at this point? I'm almost
>>> sure
>>> people would expect REST and/or MQ in this case, don't you think?
>> Standard vs non standard API? So yes it can make sense. I won't go
>> further on this topic, because as you told it I don't know exactly
>> the
>> need for such a workflow, so if people are interested please
>> contribute
>> to this discussion.
> I already talked to some people willing to use that very feature. And
> regarding IMAP protocol (without search) vs non-standard REST API,
> given that I wrote a server talking client IMAP in the past, I would
> choose the REST API by far.
>
> [...]
>
>>>>> Does disabling only ES in that context makes sense at all for
>>>>> the
>>>>> Distributed James *product*?
>>>>>
>>>>> Shouldn't we craft a specific Distributed SMTP+POP product
>>>>> instead
>>>>> that
>>>>> would remove all wastes?
>>>> It makes sense because it allows to easily go back from one
>>>> configuration to the other. Going back and forth between scanning
>>>> implementation and ES one is pretty easy.
>>> As long as you don't have real users with mails. How long will a
>>> full-
>>> reindex (that is supposed to be slow according to user complains)
>>> take
>>> with some Terabytes of emails? Is it what you call "easy"? Because
>>> having a Distributed Mail Server without a huge amount of data
>>> doesn't
>>> make much sense.
>> It depends, the Distributed Mail Server currently covers the use case
>> of
>> high availability. So it can make sense outside of the big data
>> world.
> Oh, really? You are saying that somebody would deploy a 3 nodes
> Cassandra cluster, a 2 node RabbitMQ cluster, an object storage service
> for high availability without having at least a TiB of data?
>
> Given that SMTP has the MX backup feature built-in that allows to deal
> with a service downtime?
>
> I don't think this exists. Do you have evidence people are willing to
> do that?
>
>>> So, let's be realistic: this switch, while possible with some
>>> configuration would be quite hard to handle properly in real world
>>> (it
>>> requires at least some ops and active monitoring).
>>>
>>>> Having a new (potentially optimized) product could be great in
>>>> some
>>>> cases, but would totally go against this.
>>> Can we have arguments?
>>>
>>> Bundling too many use cases in a single product is not very
>>> appealing
>>> to me because I suspect it will become be too complex by doing too
>>> many
>>> different things, confusing to user because we'll have to explain
>>> carefully in which case a specific option make sense, hard to
>>> maintain,
>>> because it's hard to make good choices when we can't figure out
>>> what
>>> are our users, etc.
>> What's the difference between explaining a configuration option and
>> explaining which product to choose?
> In one case you are use-case based, it's easy to reason about.
>
> In the other case you can combine things for whatever reason you want
> and thus it's harder to document (and make choices as dev team) because
> you don't know how it's used.
>
>>   From my point of view, one product
>> is comfortable.
> As a user or as a developer of the James product?
>
>> You know you have some configuration options that can
>> give you such or such options. Several products make you do the
>> right
>> choices at the very beginning of the project, when you don't know
>> exactly your requirements to make the right choices.
> You don't know at the beginning you want to use your server without
> real users doing search? I'm really in doubt here.
>
> [...]
>
>> Finally the configuration option is already the object of a pull
>> request, and it seems to be really simpler than to have a new
>> product
>> (in term of quantity of code and impact of the deployment -- of
>> course
>> simplicity is very subjective: for example Guice is simple for you,
>> it
>> can be different for other people).
> Yes, of course, it's less efforts to add an option than to create a
> dedicated product.
>
>> If in the future a new product makes
>> more sense, reverting this PR and building a product around this
>> would
>> not be too much of a burden. This other way would be: coming from 2
>> (potentially incompatible) products to only one would be way harder
>> in
>> term of data migration we would have to implement.
>>
>> That's also why I think the good choice for now if to add a
>> configuration option.
> What would be different? In first case you have to support the feature
> for at least one release after deprecation. You would have to document
> the migration too: are users that rely on Noop search willing to deploy
> a ES cluster just because you want to remove the support?
>
> In second case, you'll have to migrate to a supported product: same
> deprecation process, same migration doc, etc.
>
> I don't see how this is different to revert.
>
> What is sure: integrating the option in a release will cost a lot in
> the future so we may think about it carefully.
>
> -- Matthieu Baechler
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Matthieu Baechler <ma...@apache.org>.

On Mon, 2020-06-15 at 15:30 +0200, Raphaël Ouazana-Sustowski wrote:
> 

[...]

> > > I see many use cases where you would not need search, essentially
> > > based
> > > on automatic mail processing, which is a common James workflow.
> > Does it still make sense to support IMAP at this point? I'm almost
> > sure
> > people would expect REST and/or MQ in this case, don't you think?
> 
> Standard vs non standard API? So yes it can make sense. I won't go 
> further on this topic, because as you told it I don't know exactly
> the 
> need for such a workflow, so if people are interested please
> contribute 
> to this discussion.

I already talked to some people willing to use that very feature. And
regarding IMAP protocol (without search) vs non-standard REST API,
given that I wrote a server talking client IMAP in the past, I would
choose the REST API by far.

[...]

> > > > Does disabling only ES in that context makes sense at all for
> > > > the
> > > > Distributed James *product*?
> > > > 
> > > > Shouldn't we craft a specific Distributed SMTP+POP product
> > > > instead
> > > > that
> > > > would remove all wastes?
> > > It makes sense because it allows to easily go back from one
> > > configuration to the other. Going back and forth between scanning
> > > implementation and ES one is pretty easy.
> > As long as you don't have real users with mails. How long will a
> > full-
> > reindex (that is supposed to be slow according to user complains)
> > take
> > with some Terabytes of emails? Is it what you call "easy"? Because
> > having a Distributed Mail Server without a huge amount of data
> > doesn't
> > make much sense.
> 
> It depends, the Distributed Mail Server currently covers the use case
> of 
> high availability. So it can make sense outside of the big data
> world.

Oh, really? You are saying that somebody would deploy a 3 nodes
Cassandra cluster, a 2 node RabbitMQ cluster, an object storage service
for high availability without having at least a TiB of data?

Given that SMTP has the MX backup feature built-in that allows to deal
with a service downtime?

I don't think this exists. Do you have evidence people are willing to
do that?

> 
> > So, let's be realistic: this switch, while possible with some
> > configuration would be quite hard to handle properly in real world
> > (it
> > requires at least some ops and active monitoring).
> > 
> > > Having a new (potentially optimized) product could be great in
> > > some
> > > cases, but would totally go against this.
> > Can we have arguments?
> > 
> > Bundling too many use cases in a single product is not very
> > appealing
> > to me because I suspect it will become be too complex by doing too
> > many
> > different things, confusing to user because we'll have to explain
> > carefully in which case a specific option make sense, hard to
> > maintain,
> > because it's hard to make good choices when we can't figure out
> > what
> > are our users, etc.
> 
> What's the difference between explaining a configuration option and 
> explaining which product to choose?

In one case you are use-case based, it's easy to reason about.

In the other case you can combine things for whatever reason you want
and thus it's harder to document (and make choices as dev team) because
you don't know how it's used.

>  From my point of view, one product 
> is comfortable. 

As a user or as a developer of the James product?

> You know you have some configuration options that can 
> give you such or such options. Several products make you do the
> right 
> choices at the very beginning of the project, when you don't know 
> exactly your requirements to make the right choices.

You don't know at the beginning you want to use your server without
real users doing search? I'm really in doubt here.

[...]

> 
> Finally the configuration option is already the object of a pull 
> request, and it seems to be really simpler than to have a new
> product 
> (in term of quantity of code and impact of the deployment -- of
> course 
> simplicity is very subjective: for example Guice is simple for you,
> it 
> can be different for other people). 

Yes, of course, it's less efforts to add an option than to create a
dedicated product.

> If in the future a new product makes 
> more sense, reverting this PR and building a product around this
> would 
> not be too much of a burden. This other way would be: coming from 2 
> (potentially incompatible) products to only one would be way harder
> in 
> term of data migration we would have to implement.
> 
> That's also why I think the good choice for now if to add a 
> configuration option.

What would be different? In first case you have to support the feature
for at least one release after deprecation. You would have to document
the migration too: are users that rely on Noop search willing to deploy
a ES cluster just because you want to remove the support?

In second case, you'll have to migrate to a supported product: same
deprecation process, same migration doc, etc.

I don't see how this is different to revert.

What is sure: integrating the option in a release will cost a lot in
the future so we may think about it carefully.

-- Matthieu Baechler

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Raphaël Ouazana-Sustowski <ro...@apache.org>.

Hello,

Le 15/06/2020 à 09:52, Matthieu Baechler a écrit :
> Hi Raphael,
>
> On Fri, 2020-06-12 at 18:29 +0200, Raphaël Ouazana-Sustowski wrote:
>> Hello Matthieu,
>>
>> Le 12/06/2020 à 10:05, Matthieu Baechler a écrit :
>>> Hi Raphael,
>>>
>>> My answers below
>>>
>>> On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:
>>>> Hi,
>>>>
>>>> Here is a proposal to make ElasticSearch optional in our
>>>> distributed
>>>> product/flavor/server.
>>>>
>>>> Comments are welcome.
>>>>
>>>>
>>>> ## Why?
>>>>
>>>> Some people have expressed the need of using a distributed James
>>>> without
>>>> ElasticSearch:
>>>> - in some comment here:
>>>> https://issues.apache.org/jira/browse/JAMES-3086
>>> I read that people asking they are "not using search". I'm very
>>> curious
>>> about that: what does it mean to have a mail server with either
>>> IMAP
>>> and/or JMAP without using any search ?
>>>
>>> As far as I know, the main IMAP RFC requires some search support.
>>>
>>> Are they removing the `SearchProcessor` from the IMAP server and
>>> return
>>> errors to their clients? Do they expect that no user will ever hit
>>> the
>>> search button of their MUA?
>> I see many use cases where you would not need search, essentially
>> based
>> on automatic mail processing, which is a common James workflow.
> Does it still make sense to support IMAP at this point? I'm almost sure
> people would expect REST and/or MQ in this case, don't you think?


Standard vs non standard API? So yes it can make sense. I won't go 
further on this topic, because as you told it I don't know exactly the 
need for such a workflow, so if people are interested please contribute 
to this discussion.


>
>>> They complain about ElasticSearch indexing being slow (or one could
>>> also say expensive): wait until they do a full-scan search of users
>>> inbox (:
>>>
>>> I'm ok to have solutions for a different "upfront indexing
>>> cost"/"search performance" ratio but not to propose a distributed
>>> server relying on doing a full-scan of cassandra for every incoming
>>> search.
>>>
>>> We have to be open to custom usages of James and make it possible
>>> for
>>> developers to remove some features they don't need. But I'm not
>>> convinced a user should be able to do that with an configuration
>>> option.
>> That's not only because of indexing being slow, it's also to get rid
>> of
>> the whole ElasticSearch cluster.
> The link you provided, to which I refer when talking about indexing
> being slow, is about indexing being slow as far as I understand. Let's
> deal with these two cases in different threads to avoid confusion.


We did not read the same sentence:

"We don't use elasticsearch, why is it not possible to remove it?"

Again, I cannot go further on this point, I'm not the user complaining 
about the presence of ElasticSearch.


>
>>>> - one of our customers plan to deploy a distributed James server
>>>> for
>>>> serving POP3 encrypted emails. This deployment does not rely on
>>>> searching features. However as part of current Distributed James
>>>> server
>>>> he is forced to rely on ElasticSearch email indexing.
>>>>
>>>> This results in wasted resources as maintaining an ElasticSearch
>>>> cluster
>>>> to keep up with the volume is expensive.
>>>> Maintaining an ElasticSearch cluster when not needed is costly at
>>>> several levels:
>>>> - cost of infrastructure to deploy it
>>>> - cost of people having to maintain it
>>>> - performance cost on James to unnecessarily index data
>>> Meanwhile they also pay the price for IMAP and JMAP data indexing
>>> in
>>> mailbox code (generating ids that are never consumed but using
>>> cassandra LWT, same for modseq, projections in various tables that
>>> are
>>> never read, etc).
>>>
>>> And while they can easily disable such protocols they will still
>>> have a
>>> IMAP/JMAP server (with associated cost) serving only POP3.
>>>
>>> Does disabling only ES in that context makes sense at all for the
>>> Distributed James *product*?
>>>
>>> Shouldn't we craft a specific Distributed SMTP+POP product instead
>>> that
>>> would remove all wastes?
>> It makes sense because it allows to easily go back from one
>> configuration to the other. Going back and forth between scanning
>> implementation and ES one is pretty easy.
> As long as you don't have real users with mails. How long will a full-
> reindex (that is supposed to be slow according to user complains) take
> with some Terabytes of emails? Is it what you call "easy"? Because
> having a Distributed Mail Server without a huge amount of data doesn't
> make much sense.


It depends, the Distributed Mail Server currently covers the use case of 
high availability. So it can make sense outside of the big data world.


>
> So, let's be realistic: this switch, while possible with some
> configuration would be quite hard to handle properly in real world (it
> requires at least some ops and active monitoring).
>
>> Having a new (potentially optimized) product could be great in some
>> cases, but would totally go against this.
> Can we have arguments?
>
> Bundling too many use cases in a single product is not very appealing
> to me because I suspect it will become be too complex by doing too many
> different things, confusing to user because we'll have to explain
> carefully in which case a specific option make sense, hard to maintain,
> because it's hard to make good choices when we can't figure out what
> are our users, etc.


What's the difference between explaining a configuration option and 
explaining which product to choose? From my point of view, one product 
is comfortable. You know you have some configuration options that can 
give you such or such options. Several products make you do the right 
choices at the very beginning of the project, when you don't know 
exactly your requirements to make the right choices.


>
>>>> ## How ?
>>>>
>>>> Scanning search is a search implementation that is running on top
>>>> of
>>>> any
>>>> mailbox implementation, even distributed ones and does not
>>>> require
>>>> to
>>>> index data.
>>>>
>>>> Scanning Search is tested both at the component level (unit test)
>>> With 38 disabled testcases
>>>
>>>> but
>>>> also passes IMAP (MPT) tests on top of Cassandra implementation,
>>>>    as well
>>>> as JMAP memory tests, thus delivers correct results. Of course it
>>>> does
>>>> not support full text search.
>>>>
>>>> We should allow Distributed James to optionally rely on scanning
>>>> search
>>>> instead of ElasticSearch.
>>>>
>>>>     - Scanning search should be advised for deployments rarely
>>>> searching data
>>>>     - ElasticSearch should be advised when search is frequent or
>>>> requires
>>>> high performance
>>>>
>>>> We could use module choosing [1] to choose between scanning
>>>> search
>>>> and
>>>> ElasticSearch.
>>>>
>>>> To be noted that scanning search introduces no other dependencies
>>>> as
>>>> it
>>>> is part of mailbox-store thus causes no risk of library clashes.
>>>>
>>>> To be noted also that metric collection and log collection using
>>>> ElasticSearch is unaffected.
>>>>
>>> You don't mentionned what will happen in the case of a search: we
>>> are
>>> probably going to read full mails for the searched mailbox or even
>>> for
>>> a given user in case of a multi-mailboxes search to find relevant
>>> emails.
>>>
>>> For a user with 10GiB of emails, it will for sure timeout and will
>>> probably bring the whole cluster on its knees.
>>>
>>> I don't find the scanning search relevant.
>> There is no case of search.
> I don't understand that sentence: what are you describing? I mean, one
> can send a SEARCH command to IMAP and it would be served by the
> scanning search, right?


A given workflow could use the IMAP protocol without SRCH. It's what I 
had in mind, but anyway that's suppositions about a workflow I don't 
know, so I cannot tell much about it.


>
>>   If there is one, the only thing to do is to
>> add an ES cluster and change the configuration option.
> Why setting up a scanning search if you don't want any search to
> happen?! I'd rather use a NoopSearch that never finds anything.


Why not. In this case we can add a configuration option for the 
Distributed James Server:

- ES search

- scanning search (expect low search performance but in some particular 
cases, so avoid search)

- Noop search (expect no search result, or errors when searching)


>
>>>> ## Alternative
>>>>
>>>> The alternative would be to build a different
>>>> product/flavor/server
>>>> than
>>>> the distributed one, where the only difference with the
>>>> distributed
>>>> one
>>>> is that indexing will rely on scanning instead of ElasticSearch.
>>>>
>>>> The maintenance cost of such a product/flavor/server is higher
>>>> than
>>>> of a
>>>> configuration option (Docker images to release, time and energy
>>>> to
>>>> run
>>>> integration tests on it).
>>>>
>>>> Such a product/flavor is hard to brand because even if it answers
>>>> a
>>>> need, it is not so far of the distributed one, and does not
>>>> answer
>>>> needs
>>>> that are very far from it neither.
>>>>
>>>> The advantage is that is would allow to more fine tune this
>>>> solution
>>>> to
>>>> answer to the exact needs.
>>>>
>>> Another alternative would be:
>>>
>>> * implement a SMTP+POP3 product where we can progressively remove
>>> the
>>> unneeded parts as we did for the SMTP-only product
>>>
>>> * throttle the indexing to limit the impact of this process when
>>> receiving a lot of mails (at the cost of having a search index not
>>> so
>>> up-to-date)
>>>
>>> * be able to configure what is indexed (if we drop attachment
>>> indexing
>>> and full-text indexing we'll probably be way faster)
>>>
>>> It's just an example of what we could do and there are a lot of
>>> other
>>> solutions.
>>>
>>> I'm convinced both use cases are really differents and you put them
>>> together because the solution to one problem happens to somehow
>>> solve
>>> another issue at once.
>>>
>>> I propose to focus on the use case that is the most important right
>>> now
>>> and to search for solutions regardless of other issues we may have.
>>>
>>> Or at least discuss these issues in two threads.
>>>
>>> What do you think?
>>>
>> We have a simple solution allowing to configure a specific use case
>> in a
>> supported and wide used product. Is it better than a specific
>> solution
>> in a product which is harder to define for users and is probably not
>> enough of interest to be really well maintained?
>>
> I would say it's a matter of opinion: what you find simple, I find it
> confusing, etc.
>
> You also argued it's a common usage pattern to have James in this kind
> of configuration so I guess people would be eager to maintain it?
>
> Or are we proposing to maintain a feature that we don't expect to use
> by ourselves and think is not so relevant? I would then propose to not
> support that feature at all in this case.
>
> To conclude with a proposition, I would say we should focus on the use
> cases and then try to figure out what are the consequences.
>
> In this case I would describe the use case as "As an enterprise IT
> architect, I want to deploy James to handle mail interaction with
> people from the outside in an internal domain application".
>
> Not sure it covers the use case you have in mind so please comment with
> your ideas.


For my part I have only one use case in mind, it's the one of my 
customer. A configuration option would solve it. A product would also 
(that's why I put it in "Alternative"). The customer prefer the 
configuration option, I also do for the reason I exposed.

The other arguments from other use cases are slightly related, and you 
are right, as I don't know exactly them, I cannot tell if my solution 
would fit their need too.

Finally the configuration option is already the object of a pull 
request, and it seems to be really simpler than to have a new product 
(in term of quantity of code and impact of the deployment -- of course 
simplicity is very subjective: for example Guice is simple for you, it 
can be different for other people). If in the future a new product makes 
more sense, reverting this PR and building a product around this would 
not be too much of a burden. This other way would be: coming from 2 
(potentially incompatible) products to only one would be way harder in 
term of data migration we would have to implement.

That's also why I think the good choice for now if to add a 
configuration option.

Cheers,

Raphaël.


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Matthieu Baechler <ma...@apache.org>.

Hi Raphael,

On Fri, 2020-06-12 at 18:29 +0200, Raphaël Ouazana-Sustowski wrote:
> Hello Matthieu,
> 
> Le 12/06/2020 à 10:05, Matthieu Baechler a écrit :
> > Hi Raphael,
> > 
> > My answers below
> > 
> > On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:
> > > Hi,
> > > 
> > > Here is a proposal to make ElasticSearch optional in our
> > > distributed
> > > product/flavor/server.
> > > 
> > > Comments are welcome.
> > > 
> > > 
> > > ## Why?
> > > 
> > > Some people have expressed the need of using a distributed James
> > > without
> > > ElasticSearch:
> > > - in some comment here:
> > > https://issues.apache.org/jira/browse/JAMES-3086
> > I read that people asking they are "not using search". I'm very
> > curious
> > about that: what does it mean to have a mail server with either
> > IMAP
> > and/or JMAP without using any search ?
> > 
> > As far as I know, the main IMAP RFC requires some search support.
> > 
> > Are they removing the `SearchProcessor` from the IMAP server and
> > return
> > errors to their clients? Do they expect that no user will ever hit
> > the
> > search button of their MUA?
> 
> I see many use cases where you would not need search, essentially
> based 
> on automatic mail processing, which is a common James workflow.

Does it still make sense to support IMAP at this point? I'm almost sure
people would expect REST and/or MQ in this case, don't you think?

> 
> > They complain about ElasticSearch indexing being slow (or one could
> > also say expensive): wait until they do a full-scan search of users
> > inbox (:
> > 
> > I'm ok to have solutions for a different "upfront indexing
> > cost"/"search performance" ratio but not to propose a distributed
> > server relying on doing a full-scan of cassandra for every incoming
> > search.
> > 
> > We have to be open to custom usages of James and make it possible
> > for
> > developers to remove some features they don't need. But I'm not
> > convinced a user should be able to do that with an configuration
> > option.
> 
> That's not only because of indexing being slow, it's also to get rid
> of 
> the whole ElasticSearch cluster.

The link you provided, to which I refer when talking about indexing
being slow, is about indexing being slow as far as I understand. Let's
deal with these two cases in different threads to avoid confusion.

> 
> > > - one of our customers plan to deploy a distributed James server
> > > for
> > > serving POP3 encrypted emails. This deployment does not rely on
> > > searching features. However as part of current Distributed James
> > > server
> > > he is forced to rely on ElasticSearch email indexing.
> > > 
> > > This results in wasted resources as maintaining an ElasticSearch
> > > cluster
> > > to keep up with the volume is expensive.
> > > Maintaining an ElasticSearch cluster when not needed is costly at
> > > several levels:
> > > - cost of infrastructure to deploy it
> > > - cost of people having to maintain it
> > > - performance cost on James to unnecessarily index data
> > Meanwhile they also pay the price for IMAP and JMAP data indexing
> > in
> > mailbox code (generating ids that are never consumed but using
> > cassandra LWT, same for modseq, projections in various tables that
> > are
> > never read, etc).
> > 
> > And while they can easily disable such protocols they will still
> > have a
> > IMAP/JMAP server (with associated cost) serving only POP3.
> > 
> > Does disabling only ES in that context makes sense at all for the
> > Distributed James *product*?
> > 
> > Shouldn't we craft a specific Distributed SMTP+POP product instead
> > that
> > would remove all wastes?
> 
> It makes sense because it allows to easily go back from one 
> configuration to the other. Going back and forth between scanning 
> implementation and ES one is pretty easy.

As long as you don't have real users with mails. How long will a full-
reindex (that is supposed to be slow according to user complains) take
with some Terabytes of emails? Is it what you call "easy"? Because
having a Distributed Mail Server without a huge amount of data doesn't
make much sense.

So, let's be realistic: this switch, while possible with some
configuration would be quite hard to handle properly in real world (it
requires at least some ops and active monitoring).

> 
> Having a new (potentially optimized) product could be great in some 
> cases, but would totally go against this.

Can we have arguments? 

Bundling too many use cases in a single product is not very appealing
to me because I suspect it will become be too complex by doing too many
different things, confusing to user because we'll have to explain
carefully in which case a specific option make sense, hard to maintain,
because it's hard to make good choices when we can't figure out what
are our users, etc.

> 
> > > ## How ?
> > > 
> > > Scanning search is a search implementation that is running on top
> > > of
> > > any
> > > mailbox implementation, even distributed ones and does not
> > > require
> > > to
> > > index data.
> > > 
> > > Scanning Search is tested both at the component level (unit test)
> > With 38 disabled testcases
> > 
> > > but
> > > also passes IMAP (MPT) tests on top of Cassandra implementation,
> > >   as well
> > > as JMAP memory tests, thus delivers correct results. Of course it
> > > does
> > > not support full text search.
> > > 
> > > We should allow Distributed James to optionally rely on scanning
> > > search
> > > instead of ElasticSearch.
> > > 
> > >    - Scanning search should be advised for deployments rarely
> > > searching data
> > >    - ElasticSearch should be advised when search is frequent or
> > > requires
> > > high performance
> > > 
> > > We could use module choosing [1] to choose between scanning
> > > search
> > > and
> > > ElasticSearch.
> > > 
> > > To be noted that scanning search introduces no other dependencies
> > > as
> > > it
> > > is part of mailbox-store thus causes no risk of library clashes.
> > > 
> > > To be noted also that metric collection and log collection using
> > > ElasticSearch is unaffected.
> > > 
> > You don't mentionned what will happen in the case of a search: we
> > are
> > probably going to read full mails for the searched mailbox or even
> > for
> > a given user in case of a multi-mailboxes search to find relevant
> > emails.
> > 
> > For a user with 10GiB of emails, it will for sure timeout and will
> > probably bring the whole cluster on its knees.
> > 
> > I don't find the scanning search relevant.
> 
> There is no case of search.

I don't understand that sentence: what are you describing? I mean, one
can send a SEARCH command to IMAP and it would be served by the
scanning search, right?

>  If there is one, the only thing to do is to 
> add an ES cluster and change the configuration option.

Why setting up a scanning search if you don't want any search to
happen?! I'd rather use a NoopSearch that never finds anything.

> 
> > > ## Alternative
> > > 
> > > The alternative would be to build a different
> > > product/flavor/server
> > > than
> > > the distributed one, where the only difference with the
> > > distributed
> > > one
> > > is that indexing will rely on scanning instead of ElasticSearch.
> > > 
> > > The maintenance cost of such a product/flavor/server is higher
> > > than
> > > of a
> > > configuration option (Docker images to release, time and energy
> > > to
> > > run
> > > integration tests on it).
> > > 
> > > Such a product/flavor is hard to brand because even if it answers
> > > a
> > > need, it is not so far of the distributed one, and does not
> > > answer
> > > needs
> > > that are very far from it neither.
> > > 
> > > The advantage is that is would allow to more fine tune this
> > > solution
> > > to
> > > answer to the exact needs.
> > > 
> > 
> > Another alternative would be:
> > 
> > * implement a SMTP+POP3 product where we can progressively remove
> > the
> > unneeded parts as we did for the SMTP-only product
> > 
> > * throttle the indexing to limit the impact of this process when
> > receiving a lot of mails (at the cost of having a search index not
> > so
> > up-to-date)
> > 
> > * be able to configure what is indexed (if we drop attachment
> > indexing
> > and full-text indexing we'll probably be way faster)
> > 
> > It's just an example of what we could do and there are a lot of
> > other
> > solutions.
> > 
> > I'm convinced both use cases are really differents and you put them
> > together because the solution to one problem happens to somehow
> > solve
> > another issue at once.
> > 
> > I propose to focus on the use case that is the most important right
> > now
> > and to search for solutions regardless of other issues we may have.
> > 
> > Or at least discuss these issues in two threads.
> > 
> > What do you think?
> > 
> 
> We have a simple solution allowing to configure a specific use case
> in a 
> supported and wide used product. Is it better than a specific
> solution 
> in a product which is harder to define for users and is probably not 
> enough of interest to be really well maintained?
> 

I would say it's a matter of opinion: what you find simple, I find it
confusing, etc. 

You also argued it's a common usage pattern to have James in this kind
of configuration so I guess people would be eager to maintain it?

Or are we proposing to maintain a feature that we don't expect to use
by ourselves and think is not so relevant? I would then propose to not
support that feature at all in this case.

To conclude with a proposition, I would say we should focus on the use
cases and then try to figure out what are the consequences.

In this case I would describe the use case as "As an enterprise IT
architect, I want to deploy James to handle mail interaction with
people from the outside in an internal domain application".

Not sure it covers the use case you have in mind so please comment with
your ideas.

Cheers,

-- Matthieu Baechler




---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Raphaël Ouazana-Sustowski <ro...@apache.org>.

Hello Matthieu,

Le 12/06/2020 à 10:05, Matthieu Baechler a écrit :
> Hi Raphael,
>
> My answers below
>
> On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:
>> Hi,
>>
>> Here is a proposal to make ElasticSearch optional in our distributed
>> product/flavor/server.
>>
>> Comments are welcome.
>>
>>
>> ## Why?
>>
>> Some people have expressed the need of using a distributed James
>> without
>> ElasticSearch:
>> - in some comment here:
>> https://issues.apache.org/jira/browse/JAMES-3086
> I read that people asking they are "not using search". I'm very curious
> about that: what does it mean to have a mail server with either IMAP
> and/or JMAP without using any search ?
>
> As far as I know, the main IMAP RFC requires some search support.
>
> Are they removing the `SearchProcessor` from the IMAP server and return
> errors to their clients? Do they expect that no user will ever hit the
> search button of their MUA?


I see many use cases where you would not need search, essentially based 
on automatic mail processing, which is a common James workflow.


>
> They complain about ElasticSearch indexing being slow (or one could
> also say expensive): wait until they do a full-scan search of users
> inbox (:
>
> I'm ok to have solutions for a different "upfront indexing
> cost"/"search performance" ratio but not to propose a distributed
> server relying on doing a full-scan of cassandra for every incoming
> search.
>
> We have to be open to custom usages of James and make it possible for
> developers to remove some features they don't need. But I'm not
> convinced a user should be able to do that with an configuration
> option.


That's not only because of indexing being slow, it's also to get rid of 
the whole ElasticSearch cluster.


>
>> - one of our customers plan to deploy a distributed James server for
>> serving POP3 encrypted emails. This deployment does not rely on
>> searching features. However as part of current Distributed James
>> server
>> he is forced to rely on ElasticSearch email indexing.
>>
>> This results in wasted resources as maintaining an ElasticSearch
>> cluster
>> to keep up with the volume is expensive.
>> Maintaining an ElasticSearch cluster when not needed is costly at
>> several levels:
>> - cost of infrastructure to deploy it
>> - cost of people having to maintain it
>> - performance cost on James to unnecessarily index data
> Meanwhile they also pay the price for IMAP and JMAP data indexing in
> mailbox code (generating ids that are never consumed but using
> cassandra LWT, same for modseq, projections in various tables that are
> never read, etc).
>
> And while they can easily disable such protocols they will still have a
> IMAP/JMAP server (with associated cost) serving only POP3.
>
> Does disabling only ES in that context makes sense at all for the
> Distributed James *product*?
>
> Shouldn't we craft a specific Distributed SMTP+POP product instead that
> would remove all wastes?


It makes sense because it allows to easily go back from one 
configuration to the other. Going back and forth between scanning 
implementation and ES one is pretty easy.


Having a new (potentially optimized) product could be great in some 
cases, but would totally go against this.


>
>> ## How ?
>>
>> Scanning search is a search implementation that is running on top of
>> any
>> mailbox implementation, even distributed ones and does not require
>> to
>> index data.
>>
>> Scanning Search is tested both at the component level (unit test)
> With 38 disabled testcases
>
>> but
>> also passes IMAP (MPT) tests on top of Cassandra implementation,
>>   as well
>> as JMAP memory tests, thus delivers correct results. Of course it
>> does
>> not support full text search.
>>
>> We should allow Distributed James to optionally rely on scanning
>> search
>> instead of ElasticSearch.
>>
>>    - Scanning search should be advised for deployments rarely
>> searching data
>>    - ElasticSearch should be advised when search is frequent or
>> requires
>> high performance
>>
>> We could use module choosing [1] to choose between scanning search
>> and
>> ElasticSearch.
>>
>> To be noted that scanning search introduces no other dependencies as
>> it
>> is part of mailbox-store thus causes no risk of library clashes.
>>
>> To be noted also that metric collection and log collection using
>> ElasticSearch is unaffected.
>>
> You don't mentionned what will happen in the case of a search: we are
> probably going to read full mails for the searched mailbox or even for
> a given user in case of a multi-mailboxes search to find relevant
> emails.
>
> For a user with 10GiB of emails, it will for sure timeout and will
> probably bring the whole cluster on its knees.
>
> I don't find the scanning search relevant.


There is no case of search. If there is one, the only thing to do is to 
add an ES cluster and change the configuration option.


>
>> ## Alternative
>>
>> The alternative would be to build a different product/flavor/server
>> than
>> the distributed one, where the only difference with the distributed
>> one
>> is that indexing will rely on scanning instead of ElasticSearch.
>>
>> The maintenance cost of such a product/flavor/server is higher than
>> of a
>> configuration option (Docker images to release, time and energy to
>> run
>> integration tests on it).
>>
>> Such a product/flavor is hard to brand because even if it answers a
>> need, it is not so far of the distributed one, and does not answer
>> needs
>> that are very far from it neither.
>>
>> The advantage is that is would allow to more fine tune this solution
>> to
>> answer to the exact needs.
>>
>
> Another alternative would be:
>
> * implement a SMTP+POP3 product where we can progressively remove the
> unneeded parts as we did for the SMTP-only product
>
> * throttle the indexing to limit the impact of this process when
> receiving a lot of mails (at the cost of having a search index not so
> up-to-date)
>
> * be able to configure what is indexed (if we drop attachment indexing
> and full-text indexing we'll probably be way faster)
>
> It's just an example of what we could do and there are a lot of other
> solutions.
>
> I'm convinced both use cases are really differents and you put them
> together because the solution to one problem happens to somehow solve
> another issue at once.
>
> I propose to focus on the use case that is the most important right now
> and to search for solutions regardless of other issues we may have.
>
> Or at least discuss these issues in two threads.
>
> What do you think?
>

We have a simple solution allowing to configure a specific use case in a 
supported and wide used product. Is it better than a specific solution 
in a product which is harder to define for users and is probably not 
enough of interest to be really well maintained?

Cheers,

Raphaël.


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Posted by Matthieu Baechler <ma...@apache.org>.

Hi Raphael,

My answers below

On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:
> Hi,
> 
> Here is a proposal to make ElasticSearch optional in our distributed 
> product/flavor/server.
> 
> Comments are welcome.
> 
> 
> ## Why?
> 
> Some people have expressed the need of using a distributed James
> without 
> ElasticSearch:
> - in some comment here: 
> https://issues.apache.org/jira/browse/JAMES-3086

I read that people asking they are "not using search". I'm very curious
about that: what does it mean to have a mail server with either IMAP
and/or JMAP without using any search ?

As far as I know, the main IMAP RFC requires some search support. 

Are they removing the `SearchProcessor` from the IMAP server and return
errors to their clients? Do they expect that no user will ever hit the
search button of their MUA?

They complain about ElasticSearch indexing being slow (or one could
also say expensive): wait until they do a full-scan search of users
inbox (:

I'm ok to have solutions for a different "upfront indexing
cost"/"search performance" ratio but not to propose a distributed
server relying on doing a full-scan of cassandra for every incoming
search.

We have to be open to custom usages of James and make it possible for
developers to remove some features they don't need. But I'm not
convinced a user should be able to do that with an configuration
option.

> - one of our customers plan to deploy a distributed James server for 
> serving POP3 encrypted emails. This deployment does not rely on 
> searching features. However as part of current Distributed James
> server 
> he is forced to rely on ElasticSearch email indexing.
> 
> This results in wasted resources as maintaining an ElasticSearch
> cluster 
> to keep up with the volume is expensive.
> Maintaining an ElasticSearch cluster when not needed is costly at 
> several levels:
> - cost of infrastructure to deploy it
> - cost of people having to maintain it
> - performance cost on James to unnecessarily index data

Meanwhile they also pay the price for IMAP and JMAP data indexing in
mailbox code (generating ids that are never consumed but using
cassandra LWT, same for modseq, projections in various tables that are
never read, etc).

And while they can easily disable such protocols they will still have a
IMAP/JMAP server (with associated cost) serving only POP3.

Does disabling only ES in that context makes sense at all for the
Distributed James *product*?

Shouldn't we craft a specific Distributed SMTP+POP product instead that
would remove all wastes?

> ## How ?
> 
> Scanning search is a search implementation that is running on top of
> any 
> mailbox implementation, even distributed ones and does not require
> to 
> index data.
> 
> Scanning Search is tested both at the component level (unit test) 

With 38 disabled testcases

> but 
> also passes IMAP (MPT) tests on top of Cassandra implementation,
>  as well 
> as JMAP memory tests, thus delivers correct results. Of course it
> does 
> not support full text search.
> 
> We should allow Distributed James to optionally rely on scanning
> search 
> instead of ElasticSearch.
> 
>   - Scanning search should be advised for deployments rarely
> searching data
>   - ElasticSearch should be advised when search is frequent or
> requires 
> high performance
> 
> We could use module choosing [1] to choose between scanning search
> and 
> ElasticSearch.
> 
> To be noted that scanning search introduces no other dependencies as
> it 
> is part of mailbox-store thus causes no risk of library clashes.
> 
> To be noted also that metric collection and log collection using 
> ElasticSearch is unaffected.
> 

You don't mentionned what will happen in the case of a search: we are
probably going to read full mails for the searched mailbox or even for
a given user in case of a multi-mailboxes search to find relevant
emails.

For a user with 10GiB of emails, it will for sure timeout and will
probably bring the whole cluster on its knees.

I don't find the scanning search relevant.

> ## Alternative
> 
> The alternative would be to build a different product/flavor/server
> than 
> the distributed one, where the only difference with the distributed
> one 
> is that indexing will rely on scanning instead of ElasticSearch.
> 
> The maintenance cost of such a product/flavor/server is higher than
> of a 
> configuration option (Docker images to release, time and energy to
> run 
> integration tests on it).
> 
> Such a product/flavor is hard to brand because even if it answers a 
> need, it is not so far of the distributed one, and does not answer
> needs 
> that are very far from it neither.
> 
> The advantage is that is would allow to more fine tune this solution
> to 
> answer to the exact needs.
> 

Another alternative would be:

* implement a SMTP+POP3 product where we can progressively remove the
unneeded parts as we did for the SMTP-only product

* throttle the indexing to limit the impact of this process when
receiving a lot of mails (at the cost of having a search index not so
up-to-date)

* be able to configure what is indexed (if we drop attachment indexing
and full-text indexing we'll probably be way faster)

It's just an example of what we could do and there are a lot of other
solutions.

I'm convinced both use cases are really differents and you put them
together because the solution to one problem happens to somehow solve
another issue at once.

I propose to focus on the use case that is the most important right now
and to search for solutions regardless of other issues we may have.

Or at least discuss these issues in two threads.

What do you think?

-- Matthieu Baechler

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org