You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Till Toenshoff (JIRA)" <ji...@apache.org> on 2018/02/05 19:51:00 UTC

[jira] [Commented] (MESOS-8546) PythonFramework test fails with cache write failure.

    [ https://issues.apache.org/jira/browse/MESOS-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352833#comment-16352833 ] 

Till Toenshoff commented on MESOS-8546:
---------------------------------------

* why does the cache fail to write?
 ** because Pythons egg caching does not cope well with parallel processes trying to cache the very same egg.
 * why did we never see this before?
 ** we used to bind both, the scheduler as well as the executor egg into one module. Once the scheduler was loaded, the caching was solved. Now that the executor has its individual egg, we see this problem pop up cause the framework launches multiple tasks which in turn launch executors which will try to cache their driver egg. 
 * how can we solve this?
 ** the individual tasks should not mutate host machine state and hence making sure that all tasks get their individual {{PYTHON_EGG_CACHE}} within the {{MESOS_SANDBOX}} seems to be a proper solution preventing concurrency problems on that end while also making sure the state can get properly GCed.

> PythonFramework test fails with cache write failure.
> ----------------------------------------------------
>
>                 Key: MESOS-8546
>                 URL: https://issues.apache.org/jira/browse/MESOS-8546
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.6.0
>            Reporter: Till Toenshoff
>            Assignee: Till Toenshoff
>            Priority: Major
>              Labels: flaky, flaky-test, mesosphere
>
> After some recent changes, the  {{ExamplesTest.PythonFramework}} fails on centos and ubuntu rather frequently (but not always).
> The symptom always is like this (taken from an ASF CI run): 
> {noformat}
> [...]
> I0203 03:21:06.871362 11001 leveldb.cpp:347] Persisting action (16 bytes) to leveldb took 73.84466ms
> I0203 03:21:06.871433 11001 replica.cpp:712] Persisted action TRUNCATE at position 8
> I0203 03:21:06.871841 10984 replica.cpp:695] Replica received learned notice for position 8 from log-network(1)@172.17.0.4:43102
> I0203 03:21:06.908581 11004 hierarchical.cpp:2429] Filtered offer with ports:[31000-32000]; mem:9984; disk:367463 on agent 0bd8b628-491d-46a1-a358-6cc902ee2578-S1 for role * of framework 0bd8b628-491d-46a1-a358-6cc902ee2578-0000
> I0203 03:21:06.908924 11004 hierarchical.cpp:2429] Filtered offer with cpus:1; mem:10112; disk:367463; ports:[31000-32000] on agent 0bd8b628-491d-46a1-a358-6cc902ee2578-S2 for role * of framework 0bd8b628-491d-46a1-a358-6cc902ee2578-0000
> I0203 03:21:06.909207 11004 hierarchical.cpp:2429] Filtered offer with ports:[31000-32000]; mem:9984; disk:367463 on agent 0bd8b628-491d-46a1-a358-6cc902ee2578-S0 for role * of framework 0bd8b628-491d-46a1-a358-6cc902ee2578-0000
> I0203 03:21:06.909306 11004 hierarchical.cpp:1517] Performed allocation for 3 agents in 1.276217ms
> I0203 03:21:06.945303 10984 leveldb.cpp:347] Persisting action (18 bytes) to leveldb took 73.445285ms
> I0203 03:21:06.945451 10984 leveldb.cpp:423] Deleting ~2 keys from leveldb took 81868ns
> I0203 03:21:06.945477 10984 replica.cpp:712] Persisted action TRUNCATE at position 8
> Traceback (most recent call last):
> File "/mesos/mesos-1.6.0/_build/../src/examples/python/test_executor.py", line 25, in <module>
> from mesos.executor import MesosExecutorDriver
> File "build/bdist.linux-x86_64/egg/mesos/executor/__init__.py", line 17, in <module>
> File "build/bdist.linux-x86_64/egg/mesos/executor/_executor.py", line 7, in <module>
> File "build/bdist.linux-x86_64/egg/mesos/executor/_executor.py", line 4, in __bootstrap__
> File "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py", line 1172, in resource_filename
> self, resource_name
> File "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py", line 1716, in get_resource_filename
> self._extract_resource(manager, self._eager_to_zip(name))
> File "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py", line 1746, in _extract_resource
> self.egg_name, self._parts(zip_path)
> File "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py", line 1239, in get_cache_path
> self.extraction_error()
> File "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py", line 1219, in extraction_error
> raise err
> pkg_resources.ExtractionError: Can't extract file(s) to egg cache
> The following error occurred while trying to extract file(s) to the Python egg
> cache:
> [Errno 17] File exists: '/home/mesos/.python-eggs/mesos.executor-1.6.0-py2.7-linux-x86_64.egg-tmp'
> The Python egg cache directory is currently set to:
> /home/mesos/.python-eggs
> Perhaps your account does not have write access to this directory? You can
> change the cache directory by setting the PYTHON_EGG_CACHE environment
> variable to point to an accessible directory.{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)