You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Till Toenshoff (JIRA)" <ji...@apache.org> on 2015/05/13 21:55:00 UTC

[jira] [Commented] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky

    [ https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542575#comment-14542575 ] 

Till Toenshoff commented on MESOS-1303:
---------------------------------------

It is not a descriptor leakage but a failure to move a file into a non existing folder. That folder name is rendered by extracting it from the path via stout's {{os::dirname()}}. {{os::dirname()}} relies on the standard {{::dirname()}}.

The problem is that the implementation of {{::dirname}} on OSX is not thread safe (which is fine according to POSIX!). The symptom is that parallel invocations of the function will possibly mix up its results with results of other, concurrent invocations.

To fix the problem, we could use locking around invocations of those functions. As an alternative, we could replace those functions with thread safe variants, like https://android.googlesource.com/platform/bionic/+/ics-mr0/libc/bionic/dirname_r.c .

Here comes the example test source:
{noformat}
#include <errno.h>
#include <libgen.h>
#include <limits.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

const int THREAD_COUNT = 2;
int threadExitValues[THREAD_COUNT];

void* handler (void* val)
{
  int index = *(int *)val;
  int* ret = &threadExitValues[index];

  // Render a thead specific path.
  char path[PATH_MAX];
  sprintf(path, "/test_path_%d/file", index);

  // dirname() needs a buffer to operate on.
  char temp[PATH_MAX];
  strcpy(temp, path);

  char* result = dirname(strcpy(temp, path));

  if (result == NULL) {
    printf("dirname failed: %s\n", strerror(errno));
    *ret = 1;
  } else {
    char output[PATH_MAX];

    strcpy(output, result);

    // The directory name, should be contained within 'path' as long as 'path'
    // contains a directory name.
    // NOTE: Concurrent invocations of dirname may render results that fail this
    // test.
    *ret = strstr(path, output) != NULL ? 0 : 1;

    printf("thread %d: dirname(\"%s\") returns \"%s\"\n", index,
                                                          path,
                                                          output);
  }

  pthread_exit(ret);
}


int main(int argc, char *argv[])
{
  int counter = 1;

  // Keep on testing until we got a failure...
  while(true) {
    int index[THREAD_COUNT];
    int *ret;
    pthread_t thread[THREAD_COUNT];

    printf("test iteration %d\n", counter);

    for (int j=0;j < THREAD_COUNT;j++)
    {
      index[j] = j;
      pthread_create(&thread[j], NULL, handler, (void *) &index[j]);
    }

    for (int j=0;j < THREAD_COUNT;j++)
    {
      pthread_join(thread[j], (void**)&ret);

      if (*ret != 0) {
        printf("test failed\n");
        exit(1);
      }
    }

    printf("test succeeded\n\n");

    counter++;
  };
}
{noformat}

Here is the output of the example test on OSX 10.10.4:
{noformat}
[...]

test iteration 398
thread 0: dirname("/test_path_0/file") returns "/test_path_0"
thread 1: dirname("/test_path_1/file") returns "/test_path_1"
test succeeded

test iteration 399
thread 0: dirname("/test_path_0/file") returns "/test_path_0"
thread 1: dirname("/test_path_1/file") returns "/test_path_1"
test succeeded

test iteration 400
thread 1: dirname("/test_path_1/file") returns "/test_path_1"
thread 0: dirname("/test_path_0/file") returns "/test_path_1"
test failed
{noformat}


> ExamplesTest.{TestFramework, NoExecutorFramework} flaky
> -------------------------------------------------------
>
>                 Key: MESOS-1303
>                 URL: https://issues.apache.org/jira/browse/MESOS-1303
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>            Reporter: Ian Downes
>              Labels: flaky
>
> I'm having trouble reproducing this but I did observe it once on my OSX system:
> {noformat}
> [==========] Running 2 tests from 1 test case.
> [----------] Global test environment set-up.
> [----------] 2 tests from ExamplesTest
> [ RUN      ] ExamplesTest.TestFramework
> ../../src/tests/script.cpp:81: Failure
> Failed
> test_framework_test.sh terminated with signal 'Abort trap: 6'
> [  FAILED  ] ExamplesTest.TestFramework (953 ms)
> [ RUN      ] ExamplesTest.NoExecutorFramework
> [       OK ] ExamplesTest.NoExecutorFramework (10162 ms)
> [----------] 2 tests from ExamplesTest (11115 ms total)
> [----------] Global test environment tear-down
> [==========] 2 tests from 1 test case ran. (11121 ms total)
> [  PASSED  ] 1 test.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] ExamplesTest.TestFramework
> {noformat}
> when investigating a failed make check for https://reviews.apache.org/r/20971/
> {noformat}
> [----------] 6 tests from ExamplesTest
> [ RUN      ] ExamplesTest.TestFramework
> [       OK ] ExamplesTest.TestFramework (8643 ms)
> [ RUN      ] ExamplesTest.NoExecutorFramework
> tests/script.cpp:81: Failure
> Failed
> no_executor_framework_test.sh terminated with signal 'Aborted'
> [  FAILED  ] ExamplesTest.NoExecutorFramework (7220 ms)
> [ RUN      ] ExamplesTest.JavaFramework
> [       OK ] ExamplesTest.JavaFramework (11181 ms)
> [ RUN      ] ExamplesTest.JavaException
> [       OK ] ExamplesTest.JavaException (5624 ms)
> [ RUN      ] ExamplesTest.JavaLog
> [       OK ] ExamplesTest.JavaLog (6472 ms)
> [ RUN      ] ExamplesTest.PythonFramework
> [       OK ] ExamplesTest.PythonFramework (14467 ms)
> [----------] 6 tests from ExamplesTest (53607 ms total)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)