You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ponymail.apache.org by johnament <gi...@git.apache.org> on 2017/01/02 17:47:37 UTC

[GitHub] incubator-ponymail issue #314: Support for time series indices

GitHub user johnament opened an issue:

    https://github.com/apache/incubator-ponymail/issues/314

    Support for time series indices

    Ponymail is basically time series data.  It captures email data, including time sent.  Time sent may make for a good key on a time series index.
    
    https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html
    
    One of the downsides to elasticsearch is that your shard count can't change.  There is an upper limit on total data stored in an index, based on the shard count and doc size.  In order to allow ponymail to operate beyond those limits, it should store in an index keyed off of some factor.  That factor is probably configurable - daily, monthly, yearly - or off.  That index should be created automatically with settings standardized based on data volume.

----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-ponymail issue #314: Support for time series indices

Posted by sebbASF <gi...@git.apache.org>.

Github user sebbASF commented on the issue:

    https://github.com/apache/incubator-ponymail/issues/314
  
    Unlike logging events, emails don't always have an accurate timestamp.
    
    The Date: header can contain completely erroneous information (v. #87 & #122)
    In the case of mailing-list messages, there will be other timestamps that are more likely to be accurate, but are they consistent across different sources?
    Great care will need to be taken to ensure that a consistent scheme is used to ensure that mails don't 'fall through the cracks'.
    
    If the max number of docs in an index does depend on the doc size (where is this documented?), then the mbox_source index will reach the limit well before the mbox index. Because the mbox_source index is not used for searching, it should be much simpler to implement a partitioning scheme for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---