You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mynewt.apache.org by Simon Ratner <si...@proxy.co> on 2016/11/06 21:50:42 UTC

Connect timeout not always accurate

Hi devs,

I am seeing some variance in the handling of `duration_ms` param to
`ble_gap_connect` (on develop). Below are two log traces that demonstrate
it. In both cases, duration_ms=1280; the first shows the timeout event
(status=13) after 535 ticks (4280 ms), while the second timeout is at a
much more respectable 169 ticks (1352 ms).

```
23352163:[ts=2050146956ssb, mod=4 level=1] GAP procedure initiated:
connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
min_ce_len=16 max_ce_len=768
23352693:[ts=2054287572ssb, mod=4 level=1] GAP procedure initiated:
advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
adv_data_len=28
23352698:[ts=2054326632ssb, mod=64 level=2] gatt_cli: connection failed;
status=13
```

```
23355205:[ts=2073912596ssb, mod=4 level=1] GAP procedure initiated:
connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
min_ce_len=16 max_ce_len=768
23355333:[ts=2074912596ssb, mod=4 level=1] GAP procedure initiated:
advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
adv_data_len=28
23355374:[ts=2075232888ssb, mod=64 level=2] gatt_cli: connection failed;
status=13
```

I remember at some point some changes that were supposed to make the
timeouts more reliable - any ideas what might be causing this much variance?

Cheers,
simon

Re: Connect timeout not always accurate

Posted by marko kiiskila <ma...@runtime.io>.
> On Nov 7, 2016, at 8:00 AM, Christopher Collins <cc...@apache.org> wrote:
> 
> Hello all,
> 
> Will found a bug in the host, which is probably (hopefully!) the only
> bug here.  I plan on testing and committing a fix for this bug later
> today.  In the meantime, here is the bug:
> 
>    /*  (ble_hs_heartbeat_sched(), ble_hs.c) */
> 
>    /* Reset heartbeat timer if it is not currently scheduled or if the
>     * specified time is sooner than the current expiration time.
>     */
>    if (!os_callout_queued(&ble_hs_heartbeat_timer.cf_c) ||
>        OS_TIME_TICK_LT(ticks_from_now, ble_hs_heartbeat_timer.cf_c.c_ticks)) {
>        //              ^^^^            ^^^^
> 
>        ble_hs_heartbeat_timer_reset(ticks_from_now);
>    }
> 
> The mistake is marked with a // comment above.  A relative time
> (ticks_from_now) is being compared to an absolute time (cf_c.c_ticks),
> which of course yields nonsense.  I will need to double-check, but I
> think this issue is not caught by the any unit tests because each test
> starts at time 0, eliminating the distinction between absolute and
> relative time.  It would be good to advance the OS time by some amount
> prior to running timer-dependent tests.  I will make sure I can get the
> appropriate tests to fail before I fix the bug.
> 

That is a good idea. I’d prefer starting with random value. Or by
a timestamp that’s slightly smaller than 0. So you’ll some testing of
the wraparound.

> Thanks Will and Simon!
> 
> Chris
> 
> On Sun, Nov 06, 2016 at 08:44:40PM -0800, Christopher Collins wrote:
>> Hi Simon,
> 
> 
>> 
>> On Sun, Nov 06, 2016 at 01:50:42PM -0800, Simon Ratner wrote:
>>> Hi devs,
>>> 
>>> I am seeing some variance in the handling of `duration_ms` param to
>>> `ble_gap_connect` (on develop). Below are two log traces that demonstrate
>>> it. In both cases, duration_ms=1280; the first shows the timeout event
>>> (status=13) after 535 ticks (4280 ms), while the second timeout is at a
>>> much more respectable 169 ticks (1352 ms).
>>> 
>>> ```
>>> 23352163:[ts=2050146956ssb, mod=4 level=1] GAP procedure initiated:
>>> connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
>>> scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
>>> min_ce_len=16 max_ce_len=768
>>> 23352693:[ts=2054287572ssb, mod=4 level=1] GAP procedure initiated:
>>> advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
>>> own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
>>> adv_data_len=28
>>> 23352698:[ts=2054326632ssb, mod=64 level=2] gatt_cli: connection failed;
>>> status=13
>>> ```
>>> 
>>> ```
>>> 23355205:[ts=2073912596ssb, mod=4 level=1] GAP procedure initiated:
>>> connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
>>> scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
>>> min_ce_len=16 max_ce_len=768
>>> 23355333:[ts=2074912596ssb, mod=4 level=1] GAP procedure initiated:
>>> advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
>>> own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
>>> adv_data_len=28
>>> 23355374:[ts=2075232888ssb, mod=64 level=2] gatt_cli: connection failed;
>>> status=13
>>> ```
>> 
>> Hmm, I see what you mean.  I think this is probably a bug in the host,
>> though I don't see any obvious issues in the code.  I can think of a few
>> other possibilities, but they are proably not all that likely.  I want
>> to mention them here just so we can rule them out.
>> 
>> 1. Host event processing code is being starved due to tx or rx of too
>> many BLE data packets.
>> 
>> 2. Host parent task doesn't get a chance to run for an extended period
>> of time because higher priority tasks are continuously busy.
>> 
>> Option 1 concerns data packets only, not events from the controller,
>> such as advertising reports.  I think this could only be the case if
>> your device is sending or receiving notifications at a very high rate
>> (e.g., faster than 10 ms).
>> 
>> Option 2 depends on how tasks in the application are prioritized.  Does
>> your application create a task that could have gotten stuck in a busy
>> loop or similar for 30 seconds?
>> 
>> If neither of these sound likely, then it is probably a bug in Mynewt
>> that I will need to replicate.  I noticed from the log that you are
>> connecting and advertising at the same time, which I will try.  Is there
>> anything else you can say about what your application might have been
>> doing when the problem occurred?
>> 
>> Thanks,
>> Chris


Re: Connect timeout not always accurate

Posted by Christopher Collins <cc...@apache.org>.
Hello all,

Will found a bug in the host, which is probably (hopefully!) the only
bug here.  I plan on testing and committing a fix for this bug later
today.  In the meantime, here is the bug:

    /*  (ble_hs_heartbeat_sched(), ble_hs.c) */

    /* Reset heartbeat timer if it is not currently scheduled or if the
     * specified time is sooner than the current expiration time.
     */
    if (!os_callout_queued(&ble_hs_heartbeat_timer.cf_c) ||
        OS_TIME_TICK_LT(ticks_from_now, ble_hs_heartbeat_timer.cf_c.c_ticks)) {
        //              ^^^^            ^^^^

        ble_hs_heartbeat_timer_reset(ticks_from_now);
    }

The mistake is marked with a // comment above.  A relative time
(ticks_from_now) is being compared to an absolute time (cf_c.c_ticks),
which of course yields nonsense.  I will need to double-check, but I
think this issue is not caught by the any unit tests because each test
starts at time 0, eliminating the distinction between absolute and
relative time.  It would be good to advance the OS time by some amount
prior to running timer-dependent tests.  I will make sure I can get the
appropriate tests to fail before I fix the bug.

Thanks Will and Simon!

Chris

On Sun, Nov 06, 2016 at 08:44:40PM -0800, Christopher Collins wrote:
> Hi Simon,


> 
> On Sun, Nov 06, 2016 at 01:50:42PM -0800, Simon Ratner wrote:
> > Hi devs,
> > 
> > I am seeing some variance in the handling of `duration_ms` param to
> > `ble_gap_connect` (on develop). Below are two log traces that demonstrate
> > it. In both cases, duration_ms=1280; the first shows the timeout event
> > (status=13) after 535 ticks (4280 ms), while the second timeout is at a
> > much more respectable 169 ticks (1352 ms).
> > 
> > ```
> > 23352163:[ts=2050146956ssb, mod=4 level=1] GAP procedure initiated:
> > connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
> > scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
> > min_ce_len=16 max_ce_len=768
> > 23352693:[ts=2054287572ssb, mod=4 level=1] GAP procedure initiated:
> > advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
> > own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
> > adv_data_len=28
> > 23352698:[ts=2054326632ssb, mod=64 level=2] gatt_cli: connection failed;
> > status=13
> > ```
> > 
> > ```
> > 23355205:[ts=2073912596ssb, mod=4 level=1] GAP procedure initiated:
> > connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
> > scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
> > min_ce_len=16 max_ce_len=768
> > 23355333:[ts=2074912596ssb, mod=4 level=1] GAP procedure initiated:
> > advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
> > own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
> > adv_data_len=28
> > 23355374:[ts=2075232888ssb, mod=64 level=2] gatt_cli: connection failed;
> > status=13
> > ```
> 
> Hmm, I see what you mean.  I think this is probably a bug in the host,
> though I don't see any obvious issues in the code.  I can think of a few
> other possibilities, but they are proably not all that likely.  I want
> to mention them here just so we can rule them out.
> 
> 1. Host event processing code is being starved due to tx or rx of too
> many BLE data packets.
> 
> 2. Host parent task doesn't get a chance to run for an extended period
> of time because higher priority tasks are continuously busy.
> 
> Option 1 concerns data packets only, not events from the controller,
> such as advertising reports.  I think this could only be the case if
> your device is sending or receiving notifications at a very high rate
> (e.g., faster than 10 ms).
> 
> Option 2 depends on how tasks in the application are prioritized.  Does
> your application create a task that could have gotten stuck in a busy
> loop or similar for 30 seconds?
> 
> If neither of these sound likely, then it is probably a bug in Mynewt
> that I will need to replicate.  I noticed from the log that you are
> connecting and advertising at the same time, which I will try.  Is there
> anything else you can say about what your application might have been
> doing when the problem occurred?
> 
> Thanks,
> Chris

Re: Connect timeout not always accurate

Posted by Christopher Collins <cc...@apache.org>.
Hi Simon,

On Sun, Nov 06, 2016 at 01:50:42PM -0800, Simon Ratner wrote:
> Hi devs,
> 
> I am seeing some variance in the handling of `duration_ms` param to
> `ble_gap_connect` (on develop). Below are two log traces that demonstrate
> it. In both cases, duration_ms=1280; the first shows the timeout event
> (status=13) after 535 ticks (4280 ms), while the second timeout is at a
> much more respectable 169 ticks (1352 ms).
> 
> ```
> 23352163:[ts=2050146956ssb, mod=4 level=1] GAP procedure initiated:
> connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
> scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
> min_ce_len=16 max_ce_len=768
> 23352693:[ts=2054287572ssb, mod=4 level=1] GAP procedure initiated:
> advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
> own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
> adv_data_len=28
> 23352698:[ts=2054326632ssb, mod=64 level=2] gatt_cli: connection failed;
> status=13
> ```
> 
> ```
> 23355205:[ts=2073912596ssb, mod=4 level=1] GAP procedure initiated:
> connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
> scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
> min_ce_len=16 max_ce_len=768
> 23355333:[ts=2074912596ssb, mod=4 level=1] GAP procedure initiated:
> advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
> own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
> adv_data_len=28
> 23355374:[ts=2075232888ssb, mod=64 level=2] gatt_cli: connection failed;
> status=13
> ```

Hmm, I see what you mean.  I think this is probably a bug in the host,
though I don't see any obvious issues in the code.  I can think of a few
other possibilities, but they are proably not all that likely.  I want
to mention them here just so we can rule them out.

1. Host event processing code is being starved due to tx or rx of too
many BLE data packets.

2. Host parent task doesn't get a chance to run for an extended period
of time because higher priority tasks are continuously busy.

Option 1 concerns data packets only, not events from the controller,
such as advertising reports.  I think this could only be the case if
your device is sending or receiving notifications at a very high rate
(e.g., faster than 10 ms).

Option 2 depends on how tasks in the application are prioritized.  Does
your application create a task that could have gotten stuck in a busy
loop or similar for 30 seconds?

If neither of these sound likely, then it is probably a bug in Mynewt
that I will need to replicate.  I noticed from the log that you are
connecting and advertising at the same time, which I will try.  Is there
anything else you can say about what your application might have been
doing when the problem occurred?

Thanks,
Chris