You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crail.apache.org by Jeongyoon Eo <je...@gmail.com> on 2019/09/24 13:39:48 UTC

Local RDMA data path missing error in crail-client

Hi,

I'm trying Spark with Apache Crail on RDMA-capable Mellanox connected 2
machines, and I keep getting this error:

19/09/24 13:18:41 INFO ibm.disni: createEventChannel, objId 139745971170192
19/09/24 13:18:41 INFO ibm.disni: passive endpoint group, maxWR 32, maxSge
4, cqSize 64
19/09/24 13:18:41 INFO ibm.disni: launching cm processor, cmChannel 0
19/09/24 13:18:41 INFO apache.crail: new local endpoint for address /
172.30.100.108:9061
19/09/24 13:18:41 INFO apache.crail: new local dataPath
/dev/hugepages/data/172.30.100.108-9061
19/09/24 13:18:41 INFO apache.crail: ERROR: failed data operation
19/09/24 13:18:41 INFO apache.crail: new local endpoint for address /
172.30.100.108:9061
java.io.IOException: java.lang.Exception: Local RDMA data path missing
        at
org.apache.crail.storage.rdma.client.RdmaStoragePassiveGroup.createEndpoint(RdmaStoragePassiveGroup.java:53)
        at
org.apache.crail.storage.rdma.RdmaStorageClient.createEndpoint(RdmaStorageClient.java:84)
        at
org.apache.crail.utils.EndpointCache$StorageEndpointCache.getDataEndpoint(EndpointCache.java:130)
        at
org.apache.crail.utils.EndpointCache.getDataEndpoint(EndpointCache.java:69)
        at
org.apache.crail.core.CoreStream.prepareAndTrigger(CoreStream.java:230)
        at
org.apache.crail.core.CoreStream.dataOperation(CoreStream.java:100)

I've checked the read/write permission of /dev/hugepages/data/ and it was
okay.
Is there any other one experienced similar issues?

Best regards,
Jeongyoon.

*----------*
*Jeongyoon Eo*
Software Platform Lab
Department of Computer Science and Engineering
Seoul National University
Email: jeongyoon0807@gmail.com <je...@snu.ac.kr>

Re: Local RDMA data path missing error in crail-client

Posted by Patrick Stuedi <ps...@gmail.com>.
It looks to me like you have localmap enabled (it's actually true by
default) which is an optimization where for access to local blocks (served
by a local datanode) mmap is used. Somehow it seems the local endpoint
can't find the directory where the data is (which shouldn't be). In anycase
you can try turning off localmap in your crail-site.conf by adding.

crail.storage.rdma.localmap    false

If that does not fix it, please send the full printout of the config
parameters at the client.

-Patrick

On Tue, Sep 24, 2019 at 3:40 PM Jeongyoon Eo <je...@gmail.com>
wrote:

> Hi,
>
> I'm trying Spark with Apache Crail on RDMA-capable Mellanox connected 2
> machines, and I keep getting this error:
>
> 19/09/24 13:18:41 INFO ibm.disni: createEventChannel, objId 139745971170192
> 19/09/24 13:18:41 INFO ibm.disni: passive endpoint group, maxWR 32, maxSge
> 4, cqSize 64
> 19/09/24 13:18:41 INFO ibm.disni: launching cm processor, cmChannel 0
> 19/09/24 13:18:41 INFO apache.crail: new local endpoint for address /
> 172.30.100.108:9061
> 19/09/24 13:18:41 INFO apache.crail: new local dataPath
> /dev/hugepages/data/172.30.100.108-9061
> 19/09/24 13:18:41 INFO apache.crail: ERROR: failed data operation
> 19/09/24 13:18:41 INFO apache.crail: new local endpoint for address /
> 172.30.100.108:9061
> java.io.IOException: java.lang.Exception: Local RDMA data path missing
>         at
>
> org.apache.crail.storage.rdma.client.RdmaStoragePassiveGroup.createEndpoint(RdmaStoragePassiveGroup.java:53)
>         at
>
> org.apache.crail.storage.rdma.RdmaStorageClient.createEndpoint(RdmaStorageClient.java:84)
>         at
>
> org.apache.crail.utils.EndpointCache$StorageEndpointCache.getDataEndpoint(EndpointCache.java:130)
>         at
> org.apache.crail.utils.EndpointCache.getDataEndpoint(EndpointCache.java:69)
>         at
> org.apache.crail.core.CoreStream.prepareAndTrigger(CoreStream.java:230)
>         at
> org.apache.crail.core.CoreStream.dataOperation(CoreStream.java:100)
>
> I've checked the read/write permission of /dev/hugepages/data/ and it was
> okay.
> Is there any other one experienced similar issues?
>
> Best regards,
> Jeongyoon.
>
> *----------*
> *Jeongyoon Eo*
> Software Platform Lab
> Department of Computer Science and Engineering
> Seoul National University
> Email: jeongyoon0807@gmail.com <je...@snu.ac.kr>
>