You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Julien Nioche <li...@gmail.com> on 2013/10/02 15:22:13 UTC

Data locality

Hi guys,

I can't quite remember whether Gora takes data locality into account when
generating the input for a map reduce job. Could someone explain how its is
currently handled and if things differ from one backend to the other then
how?

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Data locality

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Julien,

My comments are inline as well.

2013/10/3 Julien Nioche <li...@gmail.com>

> Hi Renato,
>
> Comments below (in orange)
>
> On 3 October 2013 15:46, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Hi Julien,
> >
> > My comments are inline.
> >
> > 2013/10/3 Julien Nioche <li...@gmail.com>
> >
> > > Hi Renato
> > >
> > > Thanks for your comments
> > >
> > > Gora doesn't take into account data locality as this is taken care of
> by
> > > > each different data store.
> > >
> > >
> > > The datastores don't know anything about MapReduce nodes.
> >
> >
> > True.
> >
> >
> > > Imagine you have
> > > a Hadoop cluster with 3 nodes and that each node also runs a datastore
> > like
> > > HBase, if Gora does not enforce data locality (as it does with HDFS
> > > content), then Gora would build the mapreduce inputs using data from
> > other
> > > nodes meaning more network traffic. Am I missing something here?
> > >
> >
> > Gora still doesn't know anything about data locality, it relies on how
> the
> > InputFormat is created for each data store and in that way it is the
> > InputFormat for each data store the one in charge of passing this
> > information into Gora. Gora acts as a connector to different data stores.
>
> In the case of HBase, the input splits created do come directly from HBase
> > so that tells the MapReduce job how to connect (being aware of data
> > locality) to HBase.
>
>
> To reformulate :  the GoraInputFormat builds the splits (which hold the
> information about where the data is located) based on what the method
> getPartitions(Query q) returns for each datastore. A datastore may or may
> not provide information about where the data is located. HBase does,
> Cassandra doesn't (see below). Correct?
>

Correct Julien. Each data store implementation hast its own advantages and
disadvantages, as there are more people involved with some data store or
another. So there might be more development in a specific data store.


> > Now let's say if we want to use Gora for connecting to a stand alone SQL
> > data base, then the creation of the number of splits should probably be
> > done by getting the total number of records to be queried, and then
> create
> > a number of Mappers to retrieve this total number of records in parallel.
> > Data locality will be pass to Gora through the InputFormat for that SQL
> > data base.
> >
>
> Data locality != data location. The gora-SQL module will indeed pass on the
> information about where the data are located (data location). However
>  whether there will be data locality (i.e code running where data is
> stored) depends on the getPartitions method. In the case of a stand alone
> SQL DB by definition there won't be data locality.
>

Well if your SQL DB is running in the same node as some of your MapReduce
tasks, then there would be "data locality". But of course, as you said,
this might not be the case most of the times.


> So, in a way Gora "is aware" through the InputFormats implemented
> > specifically for each data store, but Gora does not implement this
> feature
> > by itself, it relies on data stores' input formats.
> >
>
> or rather how getPartitions is implemented?
>

The getPartitions method uses each data stores' InputFormats ;)


> > > > The one thing that Gora "take" into account is
> > > > the number of partitions it should use, and that number of partitions
> > are
> > > > used to run more/less map tasks. This partition  number hasn't been
> > > > implemented by all data stores properly and AFAIK most of them
> return a
> > > > single partition, which means we only use a Map task to read as much
> > data
> > > > as we have to.
> > > >
> > >
> > > That does not match what we found in
> > >
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.htmlas
> > > the data from HBase and Cassandra was represented as several
> partitions.
> > > see
> > >
> > > *The distribution of Mappers and Reducers for each task also stays
> > constant
> > > over all iterations with N2C, while with N2Hbase data seems to be
> > > partitioned differently in each iteration and more Mappers are required
> >  as
> > > the crawl goes on. This results in a longer processing time as our
> Hadoop
> > > setup allows only up to 2 mappers to be used at the same time.
> Curiously,
> > > this increase in the number of mappers was for the same number of
> entries
> > > as input. *
> > > *
> > > *
> > > *The number of mappers used by N2H and N2C is the main explanation for
> > the
> > > differences between them. To give an example, the generation step in
> the
> > > first iteration took 11.6 minutes with N2C whereas N2H required 20
> > minutes.
> > > The latter had its input represented by 3 Mappers whereas the former
> > > required only 2 mappers. The mapping part would have certainly taken a
> > lot
> > > less time if it had been forced into 2 mappers with a larger input (or
> if
> > > our cluster allowed more than 2 mappers / reducers) .*
> > >
> >
> > Yeah for sure this is interesting, but I think that has to do with the
> > Number of Mappers you configured on your cluster (I think that by default
> > it comes configure to use 2 mappers).
>
> > I am saying this because if you see [1], by default we are creating a
> > single partition for Cassandra, so you might be even reading that data
> > twice whereas in HBase [2] the accurate number of partitions is created.
> >
>
> So if I understand correctly :
> (a) there is no gain in using Cassandra in distributed mode as it will
> represent everything as a single partition
>

Sadly but true.


> (b) in that case why are we getting more than one mapper working? Shouldn't
> there be just a single one?
>

The number of Mappers you get is by the number of Mappers your cluster can
run. I was running simple MapReduce jobs on my local computer with default
settings and I always got two mappers, then I changed the mapred.max.split.size
to bring smaller chunks of data and got a single one. We should find a way
on how to set this parameter automatically for improving Gora's data access.


> (c) if there is a single partition then we can't enforce data locality
>

True for data stores which are not using their own input formats inside the
getPartition method.


> (d) shouldn't the HBase implementation return a constant number of
> partitions given an input of the same size?


Yes, it should. Isn't this the case?


> > Hope it makes sense.
> >
>
> Well, we are definitely making progress on the terms used at least.
>
> Thanks for having taken the time to explain, am sure this thread will be
> useful to others
>

Of course it will mate.


Renato M.


> Julien
>
>
> >
> >
> > Renato M.
> >
> > [1]
> >
> >
> https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L291
> > [2]
> >
> >
> https://github.com/apache/gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L388
> >
> > > Planning to work on this coming summer (SouthAmerican summer) ;)
> > > >
> > >
> > > Great!
> > >
> > > Thanks
> > >
> > > Julien
> > >
> > >
> > >
> > > >
> > > >
> > > > Renato M.
> > > >
> > > >
> > > > 2013/10/2 Julien Nioche <li...@gmail.com>
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I can't quite remember whether Gora takes data locality into
> account
> > > when
> > > > > generating the input for a map reduce job. Could someone explain
> how
> > > its
> > > > is
> > > > > currently handled and if things differ from one backend to the
> other
> > > then
> > > > > how?
> > > > >
> > > > > Thanks
> > > > >
> > > > > Julien
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > > http://twitter.com/digitalpebble
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Data locality

Posted by Julien Nioche <li...@gmail.com>.
Hi Renato,

Comments below (in orange)

On 3 October 2013 15:46, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi Julien,
>
> My comments are inline.
>
> 2013/10/3 Julien Nioche <li...@gmail.com>
>
> > Hi Renato
> >
> > Thanks for your comments
> >
> > Gora doesn't take into account data locality as this is taken care of by
> > > each different data store.
> >
> >
> > The datastores don't know anything about MapReduce nodes.
>
>
> True.
>
>
> > Imagine you have
> > a Hadoop cluster with 3 nodes and that each node also runs a datastore
> like
> > HBase, if Gora does not enforce data locality (as it does with HDFS
> > content), then Gora would build the mapreduce inputs using data from
> other
> > nodes meaning more network traffic. Am I missing something here?
> >
>
> Gora still doesn't know anything about data locality, it relies on how the
> InputFormat is created for each data store and in that way it is the
> InputFormat for each data store the one in charge of passing this
> information into Gora. Gora acts as a connector to different data stores.

In the case of HBase, the input splits created do come directly from HBase
> so that tells the MapReduce job how to connect (being aware of data
> locality) to HBase.


To reformulate :  the GoraInputFormat builds the splits (which hold the
information about where the data is located) based on what the method
getPartitions(Query q) returns for each datastore. A datastore may or may
not provide information about where the data is located. HBase does,
Cassandra doesn't (see below). Correct?


> Now let's say if we want to use Gora for connecting to a stand alone SQL
> data base, then the creation of the number of splits should probably be
> done by getting the total number of records to be queried, and then create
> a number of Mappers to retrieve this total number of records in parallel.
> Data locality will be pass to Gora through the InputFormat for that SQL
> data base.
>

Data locality != data location. The gora-SQL module will indeed pass on the
information about where the data are located (data location). However
 whether there will be data locality (i.e code running where data is
stored) depends on the getPartitions method. In the case of a stand alone
SQL DB by definition there won't be data locality.


So, in a way Gora "is aware" through the InputFormats implemented
> specifically for each data store, but Gora does not implement this feature
> by itself, it relies on data stores' input formats.
>

or rather how getPartitions is implemented?


>
>
> > > The one thing that Gora "take" into account is
> > > the number of partitions it should use, and that number of partitions
> are
> > > used to run more/less map tasks. This partition  number hasn't been
> > > implemented by all data stores properly and AFAIK most of them return a
> > > single partition, which means we only use a Map task to read as much
> data
> > > as we have to.
> > >
> >
> > That does not match what we found in
> > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.htmlas
> > the data from HBase and Cassandra was represented as several partitions.
> > see
> >
> > *The distribution of Mappers and Reducers for each task also stays
> constant
> > over all iterations with N2C, while with N2Hbase data seems to be
> > partitioned differently in each iteration and more Mappers are required
>  as
> > the crawl goes on. This results in a longer processing time as our Hadoop
> > setup allows only up to 2 mappers to be used at the same time. Curiously,
> > this increase in the number of mappers was for the same number of entries
> > as input. *
> > *
> > *
> > *The number of mappers used by N2H and N2C is the main explanation for
> the
> > differences between them. To give an example, the generation step in the
> > first iteration took 11.6 minutes with N2C whereas N2H required 20
> minutes.
> > The latter had its input represented by 3 Mappers whereas the former
> > required only 2 mappers. The mapping part would have certainly taken a
> lot
> > less time if it had been forced into 2 mappers with a larger input (or if
> > our cluster allowed more than 2 mappers / reducers) .*
> >
>
> Yeah for sure this is interesting, but I think that has to do with the
> Number of Mappers you configured on your cluster (I think that by default
> it comes configure to use 2 mappers).



> I am saying this because if you see [1], by default we are creating a
> single partition for Cassandra, so you might be even reading that data
> twice whereas in HBase [2] the accurate number of partitions is created.
>

So if I understand correctly :
(a) there is no gain in using Cassandra in distributed mode as it will
represent everything as a single partition
(b) in that case why are we getting more than one mapper working? Shouldn't
there be just a single one?
(c) if there is a single partition then we can't enforce data locality
(d) shouldn't the HBase implementation return a constant number of
partitions given an input of the same size?



> Hope it makes sense.
>

Well, we are definitely making progress on the terms used at least.

Thanks for having taken the time to explain, am sure this thread will be
useful to others

Julien


>
>
> Renato M.
>
> [1]
>
> https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L291
> [2]
>
> https://github.com/apache/gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L388
>
> > Planning to work on this coming summer (SouthAmerican summer) ;)
> > >
> >
> > Great!
> >
> > Thanks
> >
> > Julien
> >
> >
> >
> > >
> > >
> > > Renato M.
> > >
> > >
> > > 2013/10/2 Julien Nioche <li...@gmail.com>
> > >
> > > > Hi guys,
> > > >
> > > > I can't quite remember whether Gora takes data locality into account
> > when
> > > > generating the input for a map reduce job. Could someone explain how
> > its
> > > is
> > > > currently handled and if things differ from one backend to the other
> > then
> > > > how?
> > > >
> > > > Thanks
> > > >
> > > > Julien
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > > http://twitter.com/digitalpebble
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Data locality

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Julien,

My comments are inline.

2013/10/3 Julien Nioche <li...@gmail.com>

> Hi Renato
>
> Thanks for your comments
>
> Gora doesn't take into account data locality as this is taken care of by
> > each different data store.
>
>
> The datastores don't know anything about MapReduce nodes.


True.


> Imagine you have
> a Hadoop cluster with 3 nodes and that each node also runs a datastore like
> HBase, if Gora does not enforce data locality (as it does with HDFS
> content), then Gora would build the mapreduce inputs using data from other
> nodes meaning more network traffic. Am I missing something here?
>

Gora still doesn't know anything about data locality, it relies on how the
InputFormat is created for each data store and in that way it is the
InputFormat for each data store the one in charge of passing this
information into Gora. Gora acts as a connector to different data stores.
In the case of HBase, the input splits created do come directly from HBase
so that tells the MapReduce job how to connect (being aware of data
locality) to HBase.
Now let's say if we want to use Gora for connecting to a stand alone SQL
data base, then the creation of the number of splits should probably be
done by getting the total number of records to be queried, and then create
a number of Mappers to retrieve this total number of records in parallel.
Data locality will be pass to Gora through the InputFormat for that SQL
data base.
So, in a way Gora "is aware" through the InputFormats implemented
specifically for each data store, but Gora does not implement this feature
by itself, it relies on data stores' input formats.


> > The one thing that Gora "take" into account is
> > the number of partitions it should use, and that number of partitions are
> > used to run more/less map tasks. This partition  number hasn't been
> > implemented by all data stores properly and AFAIK most of them return a
> > single partition, which means we only use a Map task to read as much data
> > as we have to.
> >
>
> That does not match what we found in
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html as
> the data from HBase and Cassandra was represented as several partitions.
> see
>
> *The distribution of Mappers and Reducers for each task also stays constant
> over all iterations with N2C, while with N2Hbase data seems to be
> partitioned differently in each iteration and more Mappers are required  as
> the crawl goes on. This results in a longer processing time as our Hadoop
> setup allows only up to 2 mappers to be used at the same time. Curiously,
> this increase in the number of mappers was for the same number of entries
> as input. *
> *
> *
> *The number of mappers used by N2H and N2C is the main explanation for the
> differences between them. To give an example, the generation step in the
> first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes.
> The latter had its input represented by 3 Mappers whereas the former
> required only 2 mappers. The mapping part would have certainly taken a lot
> less time if it had been forced into 2 mappers with a larger input (or if
> our cluster allowed more than 2 mappers / reducers) .*
>

Yeah for sure this is interesting, but I think that has to do with the
Number of Mappers you configured on your cluster (I think that by default
it comes configure to use 2 mappers).
I am saying this because if you see [1], by default we are creating a
single partition for Cassandra, so you might be even reading that data
twice whereas in HBase [2] the accurate number of partitions is created.
Hope it makes sense.


Renato M.

[1]
https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L291
[2]
https://github.com/apache/gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L388

> Planning to work on this coming summer (SouthAmerican summer) ;)
> >
>
> Great!
>
> Thanks
>
> Julien
>
>
>
> >
> >
> > Renato M.
> >
> >
> > 2013/10/2 Julien Nioche <li...@gmail.com>
> >
> > > Hi guys,
> > >
> > > I can't quite remember whether Gora takes data locality into account
> when
> > > generating the input for a map reduce job. Could someone explain how
> its
> > is
> > > currently handled and if things differ from one backend to the other
> then
> > > how?
> > >
> > > Thanks
> > >
> > > Julien
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Data locality

Posted by Julien Nioche <li...@gmail.com>.
Hi Renato

Thanks for your comments

Gora doesn't take into account data locality as this is taken care of by
> each different data store.


The datastores don't know anything about MapReduce nodes. Imagine you have
a Hadoop cluster with 3 nodes and that each node also runs a datastore like
HBase, if Gora does not enforce data locality (as it does with HDFS
content), then Gora would build the mapreduce inputs using data from other
nodes meaning more network traffic. Am I missing something here?



> The one thing that Gora "take" into account is
> the number of partitions it should use, and that number of partitions are
> used to run more/less map tasks. This partition  number hasn't been
> implemented by all data stores properly and AFAIK most of them return a
> single partition, which means we only use a Map task to read as much data
> as we have to.
>

That does not match what we found in
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html as
the data from HBase and Cassandra was represented as several partitions. see

*The distribution of Mappers and Reducers for each task also stays constant
over all iterations with N2C, while with N2Hbase data seems to be
partitioned differently in each iteration and more Mappers are required  as
the crawl goes on. This results in a longer processing time as our Hadoop
setup allows only up to 2 mappers to be used at the same time. Curiously,
this increase in the number of mappers was for the same number of entries
as input. *
*
*
*The number of mappers used by N2H and N2C is the main explanation for the
differences between them. To give an example, the generation step in the
first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes.
The latter had its input represented by 3 Mappers whereas the former
required only 2 mappers. The mapping part would have certainly taken a lot
less time if it had been forced into 2 mappers with a larger input (or if
our cluster allowed more than 2 mappers / reducers) .*



> Planning to work on this coming summer (SouthAmerican summer) ;)
>

Great!

Thanks

Julien



>
>
> Renato M.
>
>
> 2013/10/2 Julien Nioche <li...@gmail.com>
>
> > Hi guys,
> >
> > I can't quite remember whether Gora takes data locality into account when
> > generating the input for a map reduce job. Could someone explain how its
> is
> > currently handled and if things differ from one backend to the other then
> > how?
> >
> > Thanks
> >
> > Julien
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Data locality

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Julien,

Gora doesn't take into account data locality as this is taken care of by
each different data store. The one thing that Gora "take" into account is
the number of partitions it should use, and that number of partitions are
used to run more/less map tasks. This partition  number hasn't been
implemented by all data stores properly and AFAIK most of them return a
single partition, which means we only use a Map task to read as much data
as we have to.
Planning to work on this coming summer (SouthAmerican summer) ;)


Renato M.


2013/10/2 Julien Nioche <li...@gmail.com>

> Hi guys,
>
> I can't quite remember whether Gora takes data locality into account when
> generating the input for a map reduce job. Could someone explain how its is
> currently handled and if things differ from one backend to the other then
> how?
>
> Thanks
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>