You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crail.apache.org by Sumit Sen <su...@gmail.com> on 2018/06/14 17:58:35 UTC

Crail and data locality question

I am looking into using Crail to store and access data on my compute
cluster that has nodes connected with InfiniBand.  I am trying to
understand how to make use of data locality in Crail and would appreciate
everyone's suggestions.  I'll describe my existing use case to illustrate
what I am trying to achieve.

Today I am populating data to ram disk on 8 nodes which are mapped into a
single namespace using NFS.  I then have a Spark job that is partitioned
and locality-aware so that it can execute an algorithm on this data
locally.  The Spark RDD actually only contains paths of the data files, not
the data itself.  I group files that are stored on the same node into 1
partition and map the path to the locality preference of the RDD
partition.  (I follow this odd approach because the algorithm is in C++ and
reads the data directly).

I would like to replace the NFS solution with Crail which seems more
flexible and configurable (I want to support hybrid ethernet/infiniband
clusters).  What I don't understand yet is:
- if I write a file to Crail, what is its locality?
- if I read a file from Crail, how can I know where it is stored and use
this information to feed Spark's locality preference? I.e. how would I
construct an RDD similar to the one I described above?
- is there a better way to use Crail, Spark and C++ together?  (I am trying
to avoid sending the data to C++ via the RDD.pipe() method).

Thanks,
Sumit

Re: Crail and data locality question

Posted by Patrick Stuedi <ps...@gmail.com>.
Hey Sumit,

Data locality in Crail simply means that there is a locality API
(getLocations(file, offset, length) that can be called by applications to
find out where a particular range of data is physically located, i.e. where
the corresponding blocks are. Crail has such an API, like HDFS also has
such an API. The locality API is typically used the Spark scheduler (or the
Hadoop scheduler, etc.) when scheduling tasks. For instance most of these
schedulers try to make sure tasks are executed on the machines that have
the corresponding data. With Crail this happens naturally if you put the
data into Crail and use Crail as an input source through the HDFS adaptor.

There is also an fsck tool that allows you to query the physical locations
of the blocks for a file from the shell, try ./bin/crail fsck

Let me know if that helps.

-Patrick



On Thu, Jun 14, 2018 at 7:58 PM, Sumit Sen <su...@gmail.com> wrote:

> I am looking into using Crail to store and access data on my compute
> cluster that has nodes connected with InfiniBand.  I am trying to
> understand how to make use of data locality in Crail and would appreciate
> everyone's suggestions.  I'll describe my existing use case to illustrate
> what I am trying to achieve.
>
> Today I am populating data to ram disk on 8 nodes which are mapped into a
> single namespace using NFS.  I then have a Spark job that is partitioned
> and locality-aware so that it can execute an algorithm on this data
> locally.  The Spark RDD actually only contains paths of the data files, not
> the data itself.  I group files that are stored on the same node into 1
> partition and map the path to the locality preference of the RDD
> partition.  (I follow this odd approach because the algorithm is in C++ and
> reads the data directly).
>
> I would like to replace the NFS solution with Crail which seems more
> flexible and configurable (I want to support hybrid ethernet/infiniband
> clusters).  What I don't understand yet is:
> - if I write a file to Crail, what is its locality?
> - if I read a file from Crail, how can I know where it is stored and use
> this information to feed Spark's locality preference? I.e. how would I
> construct an RDD similar to the one I described above?
> - is there a better way to use Crail, Spark and C++ together?  (I am trying
> to avoid sending the data to C++ via the RDD.pipe() method).
>
> Thanks,
> Sumit
>