You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@metron.apache.org by Guillem Mateos <bb...@gmail.com> on 2017/07/31 17:02:35 UTC

Issues with indexing topology

Hi,

I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
discussed in May.

I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology
stops indexing messages when hitting the 10.000 (10k) message mark. This is
related, as previously found by Christian, to the Kafka strategy, and after
further debugging, I could track it down to the number of uncommitted
offsets (maxUncommittedOffsets). This is specified in the Kafka spout and I
could confirm that by providing a higher or lower value (5k or 15k) the
point at which the indexing stops, is exactly that of maxUncommitedOffsets.

I understand the workaround suggested (changing the strategy from
UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
would guess the topology shouldn't really need a change on that parameter
to properly ingest data without failing. What seems to happen is that by
changing to LATEST the messages do successfully get committed to Kafka
while on the other, UNCOMMITTED_EARLIEST, at some point that might not
happen.

When I run the topology with 'LATEST' I usually see messages like this one
on the Kafka Spout (indexing topology):

o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
[{indexing-0=OffsetAndMetadata{offset=2307113,
metadata='{topic-partition=indexing-0

I do not see such messages on the Kafka Spout when I have the issue and i'm
running UNCOMMITTED_EARLIEST.

Any suggestion on what may be the real source of the issue here? I did some
tests before and it did not seem to be an issue on 0.3.0. Could this be
something related to the new Kafka metron code? Or maybe related to one of
the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
messages (METRON-569) and a few on Kafka regarding issues with the commited
offset (but most were for newer versions of Kafka than Metron is using).

Thanks

Re: Issues with indexing topology

Posted by Guillem Mateos <bb...@gmail.com>.
Hi Laurens,

We're doing additional testing, but it seems to be fixed, yes. Are you
experiencing the same problem?



2017-08-14 6:12 GMT+02:00 Laurens Vets <la...@daemon.be>:

> Hi Guillem,
>
> Did you eventually fix the problem?
>
> On 2017-08-01 11:00, Guillem Mateos wrote:
>
> On the elasticsearch.properties file, right now, I have the following
> regarding workers and executors:
>
> ##### Storm #####
> indexing.workers=1
> indexing.executors=1
> topology.worker.childopts=
> topology.auto-credentials=[]
>
> Regarding the flux file for indexing, it is set exactly to what's on
> github for metron 0.4.0
>
> Should I also double checking the enrichment topology? Could this be
> caused somehow by it?
>
> Thanks
>
> 2017-08-01 19:33 GMT+02:00 Ryan Merriman <me...@gmail.com>:
>
>> Yes you are correct they are separate concepts.  Once a tuple's tree has
>> been acked in Storm, meaning all the spouts/bolts that are required to ack
>> a tuple have done so, it is then commited to Kafka in the form of an
>> offset.  If a tuple is not completely acked, it will never be commited to
>> Kafka and the tuple will be replayed after a timeout.  Eventually you'll
>> have too many tuples in flight and will result
>>
>> I think the next step would be to review your configuration.  The way the
>> executor properties are named in Storm can be confusing so it's probably
>> best if you share your flux/property files.
>>
>> On Tue, Aug 1, 2017 at 12:01 PM, Guillem Mateos <bb...@gmail.com>
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Thanks for your quick reply. I've been trying to change a few settings
>>> today. From having the executors to 1 to have it at a different number.
>>> Also worth mentioning is that the system i'm testing this with does not
>>> have a very high message input rate right now, so I wouldn't expect to need
>>> to do any special tunning. I'm roughly at about 100 messages per minute,
>>> which is really not much.
>>>
>>> After trying with the executors on a different value I can confirm the
>>> issue still exists. I do see also quite a number of messages like this one:
>>>
>>> Discarding stale fetch response for partition indexing-0 since its
>>> offset 2565827 does not match the expected offset 2565828
>>>
>>> Regarding ackers, I was under the impression that it was something
>>> slightly different than committing. So you do ack a message and you commit
>>> it also, but it's not exactly the same. Am I right?
>>>
>>> Thanks
>>>
>>> 2017-07-31 19:40 GMT+02:00 Ryan Merriman <me...@gmail.com>:
>>>
>>>> Guillem,
>>>>
>>>> I think this ended up being caused by not having enough acker threads
>>>> to keep up.  This is controlled by the "topology.ackers.executors" Storm
>>>> property that you will find in the indexing topology flux remote.yaml
>>>> file.  It is exposed in Ambari in the "elasticsearch-properties" property
>>>> which is itself a list of properties.  Within that there is an
>>>> "indexing.executors" property.  If that is set to 0 it would definitely be
>>>> a problem and I think that may even be the default in 0.4.0.  Try changing
>>>> that to match the number of partitions dedicated to the indexing topic.
>>>>
>>>> You could also change the property directly in the flux file
>>>> ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from
>>>> the command line to verify this fixes it.  If you do use this strategy to
>>>> test, make sure you eventually make the change in Ambari so your changes
>>>> don't get overriden on a restart.  Changing this setting is confusing and
>>>> there have been some recent commits that have addressed that, exposing
>>>> "topology.ackers.executors" directly in Ambari in a dedicated indexing
>>>> topology section.
>>>>
>>>> You might want to also check out the performance tuning guide we did
>>>> recently:  https://github.com/apache/metron/blob/master/metron-platfor
>>>> m/Performance-tuning-guide.md.  If my guess is wrong and it's not the
>>>> acker thread setting, the answer is likely in there.
>>>>
>>>> Hope this helps.  If you're still stuck send us some more info and
>>>> we'll try to help you figure it out.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <bb...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
>>>>> discussed in May.
>>>>>
>>>>> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing
>>>>> topology stops indexing messages when hitting the 10.000 (10k) message
>>>>> mark. This is related, as previously found by Christian, to the Kafka
>>>>> strategy, and after further debugging, I could track it down to the number
>>>>> of uncommitted offsets (maxUncommittedOffsets). This is specified in the
>>>>> Kafka spout and I could confirm that by providing a higher or lower value
>>>>> (5k or 15k) the point at which the indexing stops, is exactly that of
>>>>> maxUncommitedOffsets.
>>>>>
>>>>> I understand the workaround suggested (changing the strategy from
>>>>> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
>>>>> would guess the topology shouldn't really need a change on that parameter
>>>>> to properly ingest data without failing. What seems to happen is that by
>>>>> changing to LATEST the messages do successfully get committed to Kafka
>>>>> while on the other, UNCOMMITTED_EARLIEST, at some point that might not
>>>>> happen.
>>>>>
>>>>> When I run the topology with 'LATEST' I usually see messages like this
>>>>> one on the Kafka Spout (indexing topology):
>>>>>
>>>>> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
>>>>> [{indexing-0=OffsetAndMetadata{offset=2307113,
>>>>> metadata='{topic-partition=indexing-0
>>>>>
>>>>> I do not see such messages on the Kafka Spout when I have the issue
>>>>> and i'm running UNCOMMITTED_EARLIEST.
>>>>>
>>>>> Any suggestion on what may be the real source of the issue here? I did
>>>>> some tests before and it did not seem to be an issue on 0.3.0. Could this
>>>>> be something related to the new Kafka metron code? Or maybe related to one
>>>>> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
>>>>> messages (METRON-569) and a few on Kafka regarding issues with the commited
>>>>> offset (but most were for newer versions of Kafka than Metron is using).
>>>>>
>>>>> Thanks
>>>>>
>>>>
>

Re: Issues with indexing topology

Posted by Laurens Vets <la...@daemon.be>.
Hi Guillem, 

Did you eventually fix the problem? 

On 2017-08-01 11:00, Guillem Mateos wrote:

> On the elasticsearch.properties file, right now, I have the following regarding workers and executors:
> 
> ##### Storm #####
> indexing.workers=1
> indexing.executors=1
> topology.worker.childopts=
> topology.auto-credentials=[]
> 
> Regarding the flux file for indexing, it is set exactly to what's on github for metron 0.4.0
> 
> Should I also double checking the enrichment topology? Could this be caused somehow by it?
> 
> Thanks 
> 
> 2017-08-01 19:33 GMT+02:00 Ryan Merriman <me...@gmail.com>:
> 
> Yes you are correct they are separate concepts.  Once a tuple's tree has been acked in Storm, meaning all the spouts/bolts that are required to ack a tuple have done so, it is then commited to Kafka in the form of an offset.  If a tuple is not completely acked, it will never be commited to Kafka and the tuple will be replayed after a timeout.  Eventually you'll have too many tuples in flight and will result  
> 
> I think the next step would be to review your configuration.  The way the executor properties are named in Storm can be confusing so it's probably best if you share your flux/property files. 
> 
> On Tue, Aug 1, 2017 at 12:01 PM, Guillem Mateos <bb...@gmail.com> wrote:
> 
> Hi Ryan,
> 
> Thanks for your quick reply. I've been trying to change a few settings today. From having the executors to 1 to have it at a different number. Also worth mentioning is that the system i'm testing this with does not have a very high message input rate right now, so I wouldn't expect to need to do any special tunning. I'm roughly at about 100 messages per minute, which is really not much.
> 
> After trying with the executors on a different value I can confirm the issue still exists. I do see also quite a number of messages like this one:
> 
> Discarding stale fetch response for partition indexing-0 since its offset 2565827 does not match the expected offset 2565828
> 
> Regarding ackers, I was under the impression that it was something slightly different than committing. So you do ack a message and you commit it also, but it's not exactly the same. Am I right?
> 
> Thanks 
> 
> 2017-07-31 19:40 GMT+02:00 Ryan Merriman <me...@gmail.com>:
> 
> Guillem, 
> 
> I think this ended up being caused by not having enough acker threads to keep up.  This is controlled by the "topology.ackers.executors" Storm property that you will find in the indexing topology flux remote.yaml file.  It is exposed in Ambari in the "elasticsearch-properties" property which is itself a list of properties.  Within that there is an "indexing.executors" property.  If that is set to 0 it would definitely be a problem and I think that may even be the default in 0.4.0.  Try changing that to match the number of partitions dedicated to the indexing topic.   
> 
> You could also change the property directly in the flux file ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from the command line to verify this fixes it.  If you do use this strategy to test, make sure you eventually make the change in Ambari so your changes don't get overriden on a restart.  Changing this setting is confusing and there have been some recent commits that have addressed that, exposing "topology.ackers.executors" directly in Ambari in a dedicated indexing topology section.   
> 
> You might want to also check out the performance tuning guide we did recently:  https://github.com/apache/metron/blob/master/metron-platform/Performance-tuning-guide.md [1].  If my guess is wrong and it's not the acker thread setting, the answer is likely in there.   
> 
> Hope this helps.  If you're still stuck send us some more info and we'll try to help you figure it out. 
> 
> Ryan 
> 
> On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <bb...@gmail.com> wrote:
> 
> Hi,
> 
> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman discussed in May.
> 
> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology stops indexing messages when hitting the 10.000 (10k) message mark. This is related, as previously found by Christian, to the Kafka strategy, and after further debugging, I could track it down to the number of uncommitted offsets (maxUncommittedOffsets). This is specified in the Kafka spout and I could confirm that by providing a higher or lower value (5k or 15k) the point at which the indexing stops, is exactly that of maxUncommitedOffsets.
> 
> I understand the workaround suggested (changing the strategy from UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I would guess the topology shouldn't really need a change on that parameter to properly ingest data without failing. What seems to happen is that by changing to LATEST the messages do successfully get committed to Kafka while on the other, UNCOMMITTED_EARLIEST, at some point that might not happen.
> 
> When I run the topology with 'LATEST' I usually see messages like this one on the Kafka Spout (indexing topology):
> 
> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka [{indexing-0=OffsetAndMetadata{offset=2307113, metadata='{topic-partition=indexing-0
> 
> I do not see such messages on the Kafka Spout when I have the issue and i'm running UNCOMMITTED_EARLIEST.
> 
> Any suggestion on what may be the real source of the issue here? I did some tests before and it did not seem to be an issue on 0.3.0. Could this be something related to the new Kafka metron code? Or maybe related to one of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment messages (METRON-569) and a few on Kafka regarding issues with the commited offset (but most were for newer versions of Kafka than Metron is using).
> 
> Thanks

 

Links:
------
[1]
https://github.com/apache/metron/blob/master/metron-platform/Performance-tuning-guide.md

Re: Issues with indexing topology

Posted by Guillem Mateos <bb...@gmail.com>.
On the elasticsearch.properties file, right now, I have the following
regarding workers and executors:

##### Storm #####
indexing.workers=1
indexing.executors=1
topology.worker.childopts=
topology.auto-credentials=[]

Regarding the flux file for indexing, it is set exactly to what's on github
for metron 0.4.0

Should I also double checking the enrichment topology? Could this be caused
somehow by it?

Thanks

2017-08-01 19:33 GMT+02:00 Ryan Merriman <me...@gmail.com>:

> Yes you are correct they are separate concepts.  Once a tuple's tree has
> been acked in Storm, meaning all the spouts/bolts that are required to ack
> a tuple have done so, it is then commited to Kafka in the form of an
> offset.  If a tuple is not completely acked, it will never be commited to
> Kafka and the tuple will be replayed after a timeout.  Eventually you'll
> have too many tuples in flight and will result
>
> I think the next step would be to review your configuration.  The way the
> executor properties are named in Storm can be confusing so it's probably
> best if you share your flux/property files.
>
> On Tue, Aug 1, 2017 at 12:01 PM, Guillem Mateos <bb...@gmail.com>
> wrote:
>
>> Hi Ryan,
>>
>> Thanks for your quick reply. I've been trying to change a few settings
>> today. From having the executors to 1 to have it at a different number.
>> Also worth mentioning is that the system i'm testing this with does not
>> have a very high message input rate right now, so I wouldn't expect to need
>> to do any special tunning. I'm roughly at about 100 messages per minute,
>> which is really not much.
>>
>> After trying with the executors on a different value I can confirm the
>> issue still exists. I do see also quite a number of messages like this one:
>>
>> Discarding stale fetch response for partition indexing-0 since its offset
>> 2565827 does not match the expected offset 2565828
>>
>> Regarding ackers, I was under the impression that it was something
>> slightly different than committing. So you do ack a message and you commit
>> it also, but it's not exactly the same. Am I right?
>>
>> Thanks
>>
>> 2017-07-31 19:40 GMT+02:00 Ryan Merriman <me...@gmail.com>:
>>
>>> Guillem,
>>>
>>> I think this ended up being caused by not having enough acker threads to
>>> keep up.  This is controlled by the "topology.ackers.executors" Storm
>>> property that you will find in the indexing topology flux remote.yaml
>>> file.  It is exposed in Ambari in the "elasticsearch-properties" property
>>> which is itself a list of properties.  Within that there is an
>>> "indexing.executors" property.  If that is set to 0 it would definitely be
>>> a problem and I think that may even be the default in 0.4.0.  Try changing
>>> that to match the number of partitions dedicated to the indexing topic.
>>>
>>> You could also change the property directly in the flux file
>>> ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from
>>> the command line to verify this fixes it.  If you do use this strategy to
>>> test, make sure you eventually make the change in Ambari so your changes
>>> don't get overriden on a restart.  Changing this setting is confusing and
>>> there have been some recent commits that have addressed that, exposing
>>> "topology.ackers.executors" directly in Ambari in a dedicated indexing
>>> topology section.
>>>
>>> You might want to also check out the performance tuning guide we did
>>> recently:  https://github.com/apache/metron/blob/master/metron-platfor
>>> m/Performance-tuning-guide.md.  If my guess is wrong and it's not the
>>> acker thread setting, the answer is likely in there.
>>>
>>> Hope this helps.  If you're still stuck send us some more info and we'll
>>> try to help you figure it out.
>>>
>>> Ryan
>>>
>>> On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <bb...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
>>>> discussed in May.
>>>>
>>>> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing
>>>> topology stops indexing messages when hitting the 10.000 (10k) message
>>>> mark. This is related, as previously found by Christian, to the Kafka
>>>> strategy, and after further debugging, I could track it down to the number
>>>> of uncommitted offsets (maxUncommittedOffsets). This is specified in the
>>>> Kafka spout and I could confirm that by providing a higher or lower value
>>>> (5k or 15k) the point at which the indexing stops, is exactly that of
>>>> maxUncommitedOffsets.
>>>>
>>>> I understand the workaround suggested (changing the strategy from
>>>> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
>>>> would guess the topology shouldn't really need a change on that parameter
>>>> to properly ingest data without failing. What seems to happen is that by
>>>> changing to LATEST the messages do successfully get committed to Kafka
>>>> while on the other, UNCOMMITTED_EARLIEST, at some point that might not
>>>> happen.
>>>>
>>>> When I run the topology with 'LATEST' I usually see messages like this
>>>> one on the Kafka Spout (indexing topology):
>>>>
>>>> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
>>>> [{indexing-0=OffsetAndMetadata{offset=2307113,
>>>> metadata='{topic-partition=indexing-0
>>>>
>>>> I do not see such messages on the Kafka Spout when I have the issue and
>>>> i'm running UNCOMMITTED_EARLIEST.
>>>>
>>>> Any suggestion on what may be the real source of the issue here? I did
>>>> some tests before and it did not seem to be an issue on 0.3.0. Could this
>>>> be something related to the new Kafka metron code? Or maybe related to one
>>>> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
>>>> messages (METRON-569) and a few on Kafka regarding issues with the commited
>>>> offset (but most were for newer versions of Kafka than Metron is using).
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: Issues with indexing topology

Posted by Ryan Merriman <me...@gmail.com>.
Yes you are correct they are separate concepts.  Once a tuple's tree has
been acked in Storm, meaning all the spouts/bolts that are required to ack
a tuple have done so, it is then commited to Kafka in the form of an
offset.  If a tuple is not completely acked, it will never be commited to
Kafka and the tuple will be replayed after a timeout.  Eventually you'll
have too many tuples in flight and will result

I think the next step would be to review your configuration.  The way the
executor properties are named in Storm can be confusing so it's probably
best if you share your flux/property files.

On Tue, Aug 1, 2017 at 12:01 PM, Guillem Mateos <bb...@gmail.com> wrote:

> Hi Ryan,
>
> Thanks for your quick reply. I've been trying to change a few settings
> today. From having the executors to 1 to have it at a different number.
> Also worth mentioning is that the system i'm testing this with does not
> have a very high message input rate right now, so I wouldn't expect to need
> to do any special tunning. I'm roughly at about 100 messages per minute,
> which is really not much.
>
> After trying with the executors on a different value I can confirm the
> issue still exists. I do see also quite a number of messages like this one:
>
> Discarding stale fetch response for partition indexing-0 since its offset
> 2565827 does not match the expected offset 2565828
>
> Regarding ackers, I was under the impression that it was something
> slightly different than committing. So you do ack a message and you commit
> it also, but it's not exactly the same. Am I right?
>
> Thanks
>
> 2017-07-31 19:40 GMT+02:00 Ryan Merriman <me...@gmail.com>:
>
>> Guillem,
>>
>> I think this ended up being caused by not having enough acker threads to
>> keep up.  This is controlled by the "topology.ackers.executors" Storm
>> property that you will find in the indexing topology flux remote.yaml
>> file.  It is exposed in Ambari in the "elasticsearch-properties" property
>> which is itself a list of properties.  Within that there is an
>> "indexing.executors" property.  If that is set to 0 it would definitely be
>> a problem and I think that may even be the default in 0.4.0.  Try changing
>> that to match the number of partitions dedicated to the indexing topic.
>>
>> You could also change the property directly in the flux file
>> ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from
>> the command line to verify this fixes it.  If you do use this strategy to
>> test, make sure you eventually make the change in Ambari so your changes
>> don't get overriden on a restart.  Changing this setting is confusing and
>> there have been some recent commits that have addressed that, exposing
>> "topology.ackers.executors" directly in Ambari in a dedicated indexing
>> topology section.
>>
>> You might want to also check out the performance tuning guide we did
>> recently:  https://github.com/apache/metron/blob/master/metron-platfor
>> m/Performance-tuning-guide.md.  If my guess is wrong and it's not the
>> acker thread setting, the answer is likely in there.
>>
>> Hope this helps.  If you're still stuck send us some more info and we'll
>> try to help you figure it out.
>>
>> Ryan
>>
>> On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <bb...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
>>> discussed in May.
>>>
>>> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing
>>> topology stops indexing messages when hitting the 10.000 (10k) message
>>> mark. This is related, as previously found by Christian, to the Kafka
>>> strategy, and after further debugging, I could track it down to the number
>>> of uncommitted offsets (maxUncommittedOffsets). This is specified in the
>>> Kafka spout and I could confirm that by providing a higher or lower value
>>> (5k or 15k) the point at which the indexing stops, is exactly that of
>>> maxUncommitedOffsets.
>>>
>>> I understand the workaround suggested (changing the strategy from
>>> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
>>> would guess the topology shouldn't really need a change on that parameter
>>> to properly ingest data without failing. What seems to happen is that by
>>> changing to LATEST the messages do successfully get committed to Kafka
>>> while on the other, UNCOMMITTED_EARLIEST, at some point that might not
>>> happen.
>>>
>>> When I run the topology with 'LATEST' I usually see messages like this
>>> one on the Kafka Spout (indexing topology):
>>>
>>> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
>>> [{indexing-0=OffsetAndMetadata{offset=2307113,
>>> metadata='{topic-partition=indexing-0
>>>
>>> I do not see such messages on the Kafka Spout when I have the issue and
>>> i'm running UNCOMMITTED_EARLIEST.
>>>
>>> Any suggestion on what may be the real source of the issue here? I did
>>> some tests before and it did not seem to be an issue on 0.3.0. Could this
>>> be something related to the new Kafka metron code? Or maybe related to one
>>> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
>>> messages (METRON-569) and a few on Kafka regarding issues with the commited
>>> offset (but most were for newer versions of Kafka than Metron is using).
>>>
>>> Thanks
>>>
>>
>>
>

Re: Issues with indexing topology

Posted by Guillem Mateos <bb...@gmail.com>.
Hi Ryan,

Thanks for your quick reply. I've been trying to change a few settings
today. From having the executors to 1 to have it at a different number.
Also worth mentioning is that the system i'm testing this with does not
have a very high message input rate right now, so I wouldn't expect to need
to do any special tunning. I'm roughly at about 100 messages per minute,
which is really not much.

After trying with the executors on a different value I can confirm the
issue still exists. I do see also quite a number of messages like this one:

Discarding stale fetch response for partition indexing-0 since its offset
2565827 does not match the expected offset 2565828

Regarding ackers, I was under the impression that it was something slightly
different than committing. So you do ack a message and you commit it also,
but it's not exactly the same. Am I right?

Thanks

2017-07-31 19:40 GMT+02:00 Ryan Merriman <me...@gmail.com>:

> Guillem,
>
> I think this ended up being caused by not having enough acker threads to
> keep up.  This is controlled by the "topology.ackers.executors" Storm
> property that you will find in the indexing topology flux remote.yaml
> file.  It is exposed in Ambari in the "elasticsearch-properties" property
> which is itself a list of properties.  Within that there is an
> "indexing.executors" property.  If that is set to 0 it would definitely be
> a problem and I think that may even be the default in 0.4.0.  Try changing
> that to match the number of partitions dedicated to the indexing topic.
>
> You could also change the property directly in the flux file
> ($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from
> the command line to verify this fixes it.  If you do use this strategy to
> test, make sure you eventually make the change in Ambari so your changes
> don't get overriden on a restart.  Changing this setting is confusing and
> there have been some recent commits that have addressed that, exposing
> "topology.ackers.executors" directly in Ambari in a dedicated indexing
> topology section.
>
> You might want to also check out the performance tuning guide we did
> recently:  https://github.com/apache/metron/blob/master/metron-
> platform/Performance-tuning-guide.md.  If my guess is wrong and it's not
> the acker thread setting, the answer is likely in there.
>
> Hope this helps.  If you're still stuck send us some more info and we'll
> try to help you figure it out.
>
> Ryan
>
> On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <bb...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
>> discussed in May.
>>
>> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology
>> stops indexing messages when hitting the 10.000 (10k) message mark. This is
>> related, as previously found by Christian, to the Kafka strategy, and after
>> further debugging, I could track it down to the number of uncommitted
>> offsets (maxUncommittedOffsets). This is specified in the Kafka spout and I
>> could confirm that by providing a higher or lower value (5k or 15k) the
>> point at which the indexing stops, is exactly that of maxUncommitedOffsets.
>>
>> I understand the workaround suggested (changing the strategy from
>> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
>> would guess the topology shouldn't really need a change on that parameter
>> to properly ingest data without failing. What seems to happen is that by
>> changing to LATEST the messages do successfully get committed to Kafka
>> while on the other, UNCOMMITTED_EARLIEST, at some point that might not
>> happen.
>>
>> When I run the topology with 'LATEST' I usually see messages like this
>> one on the Kafka Spout (indexing topology):
>>
>> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
>> [{indexing-0=OffsetAndMetadata{offset=2307113,
>> metadata='{topic-partition=indexing-0
>>
>> I do not see such messages on the Kafka Spout when I have the issue and
>> i'm running UNCOMMITTED_EARLIEST.
>>
>> Any suggestion on what may be the real source of the issue here? I did
>> some tests before and it did not seem to be an issue on 0.3.0. Could this
>> be something related to the new Kafka metron code? Or maybe related to one
>> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
>> messages (METRON-569) and a few on Kafka regarding issues with the commited
>> offset (but most were for newer versions of Kafka than Metron is using).
>>
>> Thanks
>>
>
>

Re: Issues with indexing topology

Posted by Ryan Merriman <me...@gmail.com>.
Guillem,

I think this ended up being caused by not having enough acker threads to
keep up.  This is controlled by the "topology.ackers.executors" Storm
property that you will find in the indexing topology flux remote.yaml
file.  It is exposed in Ambari in the "elasticsearch-properties" property
which is itself a list of properties.  Within that there is an
"indexing.executors" property.  If that is set to 0 it would definitely be
a problem and I think that may even be the default in 0.4.0.  Try changing
that to match the number of partitions dedicated to the indexing topic.

You could also change the property directly in the flux file
($METRON_HOME/flux/indexing/remote.yaml) and restart the topology from the
command line to verify this fixes it.  If you do use this strategy to test,
make sure you eventually make the change in Ambari so your changes don't
get overriden on a restart.  Changing this setting is confusing and there
have been some recent commits that have addressed that, exposing
"topology.ackers.executors" directly in Ambari in a dedicated indexing
topology section.

You might want to also check out the performance tuning guide we did
recently:
https://github.com/apache/metron/blob/master/metron-platform/Performance-tuning-guide.md.
If my guess is wrong and it's not the acker thread setting, the answer is
likely in there.

Hope this helps.  If you're still stuck send us some more info and we'll
try to help you figure it out.

Ryan

On Mon, Jul 31, 2017 at 12:02 PM, Guillem Mateos <bb...@gmail.com>
wrote:

> Hi,
>
> I'm facing an issue like the one Christian Tramnitz and Ryan Merriman
> discussed in May.
>
> I have a Metron deployment using 0.4.0 on 10 nodes. The indexing topology
> stops indexing messages when hitting the 10.000 (10k) message mark. This is
> related, as previously found by Christian, to the Kafka strategy, and after
> further debugging, I could track it down to the number of uncommitted
> offsets (maxUncommittedOffsets). This is specified in the Kafka spout and I
> could confirm that by providing a higher or lower value (5k or 15k) the
> point at which the indexing stops, is exactly that of maxUncommitedOffsets.
>
> I understand the workaround suggested (changing the strategy from
> UNCOMMITTED_EARLIEST to LATEST) is really a workaround and not a fix, as I
> would guess the topology shouldn't really need a change on that parameter
> to properly ingest data without failing. What seems to happen is that by
> changing to LATEST the messages do successfully get committed to Kafka
> while on the other, UNCOMMITTED_EARLIEST, at some point that might not
> happen.
>
> When I run the topology with 'LATEST' I usually see messages like this one
> on the Kafka Spout (indexing topology):
>
> o.a.s.k.s.KafkaSpout [DEBUG] Offsets successfully committed to Kafka
> [{indexing-0=OffsetAndMetadata{offset=2307113, metadata='{topic-partition=
> indexing-0
>
> I do not see such messages on the Kafka Spout when I have the issue and
> i'm running UNCOMMITTED_EARLIEST.
>
> Any suggestion on what may be the real source of the issue here? I did
> some tests before and it did not seem to be an issue on 0.3.0. Could this
> be something related to the new Kafka metron code? Or maybe related to one
> of the PR's in Metron or Kafka (I saw one in Metron about dupe enrichment
> messages (METRON-569) and a few on Kafka regarding issues with the commited
> offset (but most were for newer versions of Kafka than Metron is using).
>
> Thanks
>