You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "Justin Leet (JIRA)" <ji...@apache.org> on 2018/11/29 19:52:00 UTC

[jira] [Created] (METRON-1912) Allow for indexing batches to be handled based on size

Justin Leet created METRON-1912:
-----------------------------------

             Summary: Allow for indexing batches to be handled based on size
                 Key: METRON-1912
                 URL: https://issues.apache.org/jira/browse/METRON-1912
             Project: Metron
          Issue Type: Improvement
            Reporter: Justin Leet


In the indexing topology, batching of output is handled on a per sensor basis. E.g. bro and snort will each be batched independently and shipped to ES as either the batch reaches the per-sensor configured value or per-sensor configured timeout.

These batches are on a numerical basis, not a sizing basis. This means each individual sensor must be tuned independently, despite it tying into overall performance. Tuning batches in Elasticsearch, per the documentation, is heavily dependent on the data size of the batch rather than the number of items in a batch. Too small bulks result in too many requests and have potential performance bottlenecks, but too large bulks ("beyond a couple tens of megabytes") can cause also ES degradation.

Moving to data size batching can be broken up into two variants, managing this per sensor or moving everything to a single batch that sends for all sensors as needed.
 * If we manage per sensor, this might allow us to provide more reasonable per sensor defaults that avoid simple copy-pasted and cause misbehavior. Batch sizes are more likely to be relatively correct overall, although introduction of new sensors may still cause problems as more batches are sent. However, misbehavior is still very possible in the same manner as currently exists. Additional tuning as new sensors are onboarded is a potential cause for concern here (as it is in the current setup).
 * If we manage a pool for all sensors, this could substantially smooth out problems. Because all batches would be largely the same data size and configured at a single point, the opportunities for a sensor to misbehave are minimized (although a single sensor could send outlier messages, e.g. a 100 MB message, but these messages would still currently be problematic by causing enormously sized batches). Configuration likely moves to the global config, and the existing batching is refactored to avoid breaking backward compatibility. This approach may also be mirrored by other batching (e.g. to Kafka) to ensure a consistent experience. Tuning indexing should also be easier, as its more dependent on how much we're pushing to Elasticsearch and the particular cluster, rather than tuning each sensor.

I'm in favor of the second option, but any implementation likely requires a discussion on the dev list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)