You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crail.apache.org by Animesh Trivedi <an...@gmail.com> on 2018/06/09 08:29:59 UTC

Re: Crail iobench -- help needed

Hi Sumit,

Great that you attended the talk. Please also join the crail mailing list (
crail@crail.apache.org, cc'ed) and post issues there so that others can
benefit from it. As you might have figured out that we are a new project,
so we are still learning the ropes :)

Having said that :

1) The RDMA tier failure looks like (i) if the Infiniband device is not
setup properly (what does ibvc_devices show?) ; and/or (ii) you do no have
permission to register large memory segments (check with ulimit -l). I
think the default is 64kB. If that is so, then you have to increase the
memory limit (https://access.redhat.com/solutions/61334, memlock). For the
RDMA tier, crail needs to register memory that is typically more than just
few kBs.

2) The TPC tier error is more cryptic. So may be other develops might have
an idea what might be wrong. Could you also please post your crail
configuration.

Cheers,
--
Animesh


On Sat, Jun 9, 2018 at 1:00 AM, Sumit Sen <su...@gmail.com> wrote:

> Hi Animesh,
>
> I've just started trying to use Crail on a cluster running SLES12. I
> attended the talk at Spark Summit which mentioned crail.  Our nodes are
> connected with both ethernet and infiniband.  I want to run some of the
> benchmarks to see what sort of performance I can get.  However I am running
> into problems and haven't been able to figure out what to do.  Can you help
> me or give me the name of someone else who can help?  I've given some
> details below. I'd appreciate any help I can get to come up to speed on
> this.
>
> Thanks,
> Sumit
>
> Here are the issues I'm facing:
> *RDMA configuration:*
> Unable to start data node:
> Exception in thread "main" java.io.IOException: Memory registration failed
> with -1
>         at com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(
> NatRegMrCall.java:80)
>         at com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(
> NatRegMrCall.java:33)
>         at org.apache.crail.storage.rdma.RdmaStorageServer.
> allocateResource(RdmaStorageServer.java:120)
>         at org.apache.crail.storage.StorageServer.main(
> StorageServer.java:152)
>
> *TCP configuration:*
> - both namenode and datanode start up
> However, I can't run "iobench -t write". I get an immediate error that
> crashes the jvm on the datanode
> I see the following stack on the iobench console:
> warmUp, warmupFile /tmp.dat2001725267, operations 32
> Exception in thread "main" java.util.concurrent.ExecutionException:
> java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException:
> java.io.IOException: Connection reset by peer
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>         at org.apache.crail.tools.CrailBenchmark.warmUp(
> CrailBenchmark.java:978)
>         at org.apache.crail.tools.CrailBenchmark.write(
> CrailBenchmark.java:97)
>         at org.apache.crail.tools.CrailBenchmark.main(
> CrailBenchmark.java:1070)
> Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException:
> java.io.IOException: Connection reset by peer
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>         ... 3 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
> Connection reset by peer
>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:73)
>         at org.apache.crail.storage.tcp.TcpStorageFuture.get(
> TcpStorageFuture.java:56)
>         at org.apache.crail.storage.tcp.TcpStorageFuture.get(
> TcpStorageFuture.java:30)
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>         ... 4 more
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at com.ibm.narpc.NaRPCChannel.fetchBuffer(NaRPCChannel.java:51)
>         at com.ibm.narpc.NaRPCEndpoint.pollResponse(NaRPCEndpoint.java:74)
>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:70)
>         ... 7 more
>
>

Re: Crail iobench -- help needed

Posted by Sumit Sen <su...@gmail.com>.
Thanks for the help. I've been able to get the RDMA setup working and am
troubleshooting a few issues with the bench tests.  The issues so far have
all been configuration related: ulimit -l, incorrect value for
"crail.namenode.rpctype"
I am ignoring the TCP tier for now since I don't really need it yet.

I have more questions about data locality and Spark which I'll ask in
another post.

Thanks for all your help,

Sumit

On Sat, Jun 9, 2018 at 1:30 AM Animesh Trivedi <an...@gmail.com>
wrote:

> Hi Sumit,
>
> Great that you attended the talk. Please also join the crail mailing list
> (crail@crail.apache.org, cc'ed) and post issues there so that others can
> benefit from it. As you might have figured out that we are a new project,
> so we are still learning the ropes :)
>
> Having said that :
>
> 1) The RDMA tier failure looks like (i) if the Infiniband device is not
> setup properly (what does ibvc_devices show?) ; and/or (ii) you do no have
> permission to register large memory segments (check with ulimit -l). I
> think the default is 64kB. If that is so, then you have to increase the
> memory limit (https://access.redhat.com/solutions/61334, memlock). For
> the RDMA tier, crail needs to register memory that is typically more than
> just few kBs.
>
> 2) The TPC tier error is more cryptic. So may be other develops might have
> an idea what might be wrong. Could you also please post your crail
> configuration.
>
> Cheers,
> --
> Animesh
>
>
> On Sat, Jun 9, 2018 at 1:00 AM, Sumit Sen <su...@gmail.com> wrote:
>
>> Hi Animesh,
>>
>> I've just started trying to use Crail on a cluster running SLES12. I
>> attended the talk at Spark Summit which mentioned crail.  Our nodes are
>> connected with both ethernet and infiniband.  I want to run some of the
>> benchmarks to see what sort of performance I can get.  However I am running
>> into problems and haven't been able to figure out what to do.  Can you help
>> me or give me the name of someone else who can help?  I've given some
>> details below. I'd appreciate any help I can get to come up to speed on
>> this.
>>
>> Thanks,
>> Sumit
>>
>> Here are the issues I'm facing:
>> *RDMA configuration:*
>> Unable to start data node:
>> Exception in thread "main" java.io.IOException: Memory registration
>> failed with -1
>>         at
>> com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(NatRegMrCall.java:80)
>>         at
>> com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(NatRegMrCall.java:33)
>>         at
>> org.apache.crail.storage.rdma.RdmaStorageServer.allocateResource(RdmaStorageServer.java:120)
>>         at
>> org.apache.crail.storage.StorageServer.main(StorageServer.java:152)
>>
>> *TCP configuration:*
>> - both namenode and datanode start up
>> However, I can't run "iobench -t write". I get an immediate error that
>> crashes the jvm on the datanode
>> I see the following stack on the iobench console:
>> warmUp, warmupFile /tmp.dat2001725267, operations 32
>> Exception in thread "main" java.util.concurrent.ExecutionException:
>> java.util.concurrent.ExecutionException:
>> java.util.concurrent.ExecutionException: java.io.IOException: Connection
>> reset by peer
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>>         at
>> org.apache.crail.tools.CrailBenchmark.warmUp(CrailBenchmark.java:978)
>>         at
>> org.apache.crail.tools.CrailBenchmark.write(CrailBenchmark.java:97)
>>         at
>> org.apache.crail.tools.CrailBenchmark.main(CrailBenchmark.java:1070)
>> Caused by: java.util.concurrent.ExecutionException:
>> java.util.concurrent.ExecutionException: java.io.IOException: Connection
>> reset by peer
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>>         ... 3 more
>> Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
>> Connection reset by peer
>>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:73)
>>         at
>> org.apache.crail.storage.tcp.TcpStorageFuture.get(TcpStorageFuture.java:56)
>>         at
>> org.apache.crail.storage.tcp.TcpStorageFuture.get(TcpStorageFuture.java:30)
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>>         ... 4 more
>> Caused by: java.io.IOException: Connection reset by peer
>>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>>         at com.ibm.narpc.NaRPCChannel.fetchBuffer(NaRPCChannel.java:51)
>>         at com.ibm.narpc.NaRPCEndpoint.pollResponse(NaRPCEndpoint.java:74)
>>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:70)
>>         ... 7 more
>>
>>
>