You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gora.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/03/01 02:01:56 UTC

mapreduce.GoraRecordWriter configuration settings

Hi,
We use the above class for write operations in the Nutch InjectorJob.
I am writing large URL lists to Cassandra using Gora and wonder if I can
get it working better.
Currently I am getting around 10000 writes per 90 seconds. Don't get me
wrong, I am working from a very primitive laptop and right now I am merely
attempting to push the software.
What I want to know, is what is the consequence of altering the
BUFFER_LIMIT_WRITE_VALUE?
Currently we set a default value of 10K for the limit on this value,
meaning that Gora batches flushes to reflect this value.
Is a higher or lower value better? Is there any evidence of better
performance by changing this value.
I see it a pretty critical so I am wanting to understand more about this.
Thanks
Lewis

-- 
*Lewis*

Re: mapreduce.GoraRecordWriter configuration settings

Posted by Roland von Herget <ro...@gmail.com>.
Hi Lewis,

what's a write here? one column or one key with more columns?
I think we can get a major speed improvement for the case where we
write two or more columns for one key. I will create an jira later for
that and try to do it next week.
Although we have to take a look how to expose all/most hector config
possibilities (like cluster auto discover, ...)

>From my nutch log (during inject):
2013-03-09 17:28:32,948 INFO  mapreduce.GoraRecordWriter - Flushing
the datastore after 1000 records
2013-03-09 17:28:35,470 INFO  mapreduce.GoraRecordWriter - Flushing
the datastore after 2000 records
2013-03-09 17:28:37,940 INFO  mapreduce.GoraRecordWriter - Flushing
the datastore after 3000 records
2013-03-09 17:28:40,326 INFO  mapreduce.GoraRecordWriter - Flushing
the datastore after 4000 records

so my new 3 node cluster needs about 2,5sec / 1000 records. I think I
have to work on it :)

--Roland

On Sat, Mar 9, 2013 at 12:50 AM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Guys,
>
> Just a quick update on some findings here
>
> With 1M URLs and gora.buffer.write.limit settings of 1000 and 100
> respectively (on a reasonably powerful machine) I get the following results
>
> 1000 limit
> -time elapsed: 9m42s or 582s
> -writes p/s 1718
>
> 100 limit
> -time elapsed: 9m33s or 573s
> -writes p/s 1745
>
> So reducing the write factor (in Cassandra) to the low limit of 100 knocks
> 1.5ish% off execute time and increases write throughout to Cassandra by
> around 25 p/s... which is really what we expect from Cassandra anyway.
>
> I am as happy with these results to I'll stick to low maximum limits for
> buffered writes (with Cassandra) from now on.
>
> Have a great weekend.
> Lewis
>
>
>
> On Tue, Mar 5, 2013 at 10:09 AM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
>>
>> Thanks for the input Roland. I share a similar use case.
>> @Renato, the gora.write.buffer.limit property can be overridden within the
>> Hadoop Configuration. AFAIK we can override in nutch-site.xml if using Nutch
>> or core-site.xml if using Gora over hadoop.
>> This is the way I have been tinkering.
>> I was curious as to obtaining performance gains.
>>
>>
>> On Tuesday, March 5, 2013, Renato Marroquín Mogrovejo
>> <re...@gmail.com> wrote:
>> > This is a very interesting topic to discuss about thank you for starting
>> > it Lewis (:
>> > I think we have to think about two different application types, the ones
>> > doing real time processing, and the ones doing batch processing. For the
>> > former, a smaller flush-threshold is probably a better choice, and for the
>> > latter one a value depending on the application should be used i.e.
>> > different applications might consider "batch operations differently".
>> > Just one quick question here Lewis, is this possible to set this
>> > parameter through the configuration file? or is it always hard-coded? I
>> > think it should be settable from outside Gora without having to recompile
>> > Gora every time we want to change it. What do you guys think?
>> >
>> >
>> > Renato M.
>> >
>> > On Mar 5, 2013 7:23 AM, "Roland" <ro...@rvh-gmbh.de> wrote:Hi Lewis,
>> >>
>> >> for me (nutch use case) a lower value is better, because of 3 main
>> >> reasons:
>> >> a) load is better distributed for the db backend
>> >> b) when running the nutch fetcherJob, towards the end of the job you
>> >> don't have to wait for gora flushing all data to backend, because it was
>> >> mostly done during the fetching
>> >> c) during debugging you'll get gora/cassandra flushing errors much
>> >> earlier
>> >>
>> >> I'm running with 1k write buffer for cassandra.
>> >>
>> >> --Roland
>> >>
>> >> Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney:
>> >>
>> >> Hi,
>> >> We use the above class for write operations in the Nutch InjectorJob.
>> >> I am writing large URL lists to Cassandra using Gora and wonder if I
>> >> can get it working better.
>> >> Currently I am getting around 10000 writes per 90 seconds. Don't get me
>> >> wrong, I am working from a very primitive laptop and right now I am merely
>> >> attempting to push the software.
>> >> What I want to know, is what is the consequence of altering the
>> >> BUFFER_LIMIT_WRITE_VALUE?
>> >> Currently we set a default value of 10K for the limit on this value,
>> >> meaning that Gora batches flushes to reflect this value.
>> >> Is a higher or lower value better? Is there any evidence of better
>> >> performance by changing this value.
>> >> I see it a pretty critical so I am wanting to understand more about
>> >> this.
>> >> Thanks
>> >> Lewis
>> >>
>> >> --
>> >> Lewis
>> >>
>> >>
>> >
>>
>> --
>> Lewis
>>
>
>
>
> --
> Lewis

Re: mapreduce.GoraRecordWriter configuration settings

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Guys,

Just a quick update on some findings here

With 1M URLs and gora.buffer.write.limit settings of 1000 and 100
respectively (on a reasonably powerful machine) I get the following results

1000 limit
-time elapsed: 9m42s or 582s
-writes p/s 1718

100 limit
-time elapsed: 9m33s or 573s
-writes p/s 1745

So reducing the write factor (in Cassandra) to the low limit of 100 knocks
1.5ish% off execute time and increases write throughout to Cassandra by
around 25 p/s... which is really what we expect from Cassandra anyway.

I am as happy with these results to I'll stick to low maximum limits for
buffered writes (with Cassandra) from now on.

Have a great weekend.
Lewis


On Tue, Mar 5, 2013 at 10:09 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Thanks for the input Roland. I share a similar use case.
> @Renato, the gora.write.buffer.limit property can be overridden within the
> Hadoop Configuration. AFAIK we can override in nutch-site.xml if using
> Nutch or core-site.xml if using Gora over hadoop.
> This is the way I have been tinkering.
> I was curious as to obtaining performance gains.
>
>
> On Tuesday, March 5, 2013, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
> > This is a very interesting topic to discuss about thank you for starting
> it Lewis (:
> > I think we have to think about two different application types, the ones
> doing real time processing, and the ones doing batch processing. For the
> former, a smaller flush-threshold is probably a better choice, and for
> the latter one a value depending on the application should be used i.e.
> different applications might consider "batch operations differently".
> > Just one quick question here Lewis, is this possible to set this
> parameter through the configuration file? or is it always hard-coded? I
> think it should be settable from outside Gora without having to recompile
> Gora every time we want to change it. What do you guys think?
> >
> >
> > Renato M.
> >
> > On Mar 5, 2013 7:23 AM, "Roland" <ro...@rvh-gmbh.de> wrote:Hi Lewis,
> >>
> >> for me (nutch use case) a lower value is better, because of 3 main
> reasons:
> >> a) load is better distributed for the db backend
> >> b) when running the nutch fetcherJob, towards the end of the job you
> don't have to wait for gora flushing all data to backend, because it was
> mostly done during the fetching
> >> c) during debugging you'll get gora/cassandra flushing errors much
> earlier
> >>
> >> I'm running with 1k write buffer for cassandra.
> >>
> >> --Roland
> >>
> >> Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney:
> >>
> >> Hi,
> >> We use the above class for write operations in the Nutch InjectorJob.
> >> I am writing large URL lists to Cassandra using Gora and wonder if I
> can get it working better.
> >> Currently I am getting around 10000 writes per 90 seconds. Don't get me
> wrong, I am working from a very primitive laptop and right now I am merely
> attempting to push the software.
> >> What I want to know, is what is the consequence of altering the
> BUFFER_LIMIT_WRITE_VALUE?
> >> Currently we set a default value of 10K for the limit on this value,
> meaning that Gora batches flushes to reflect this value.
> >> Is a higher or lower value better? Is there any evidence of better
> performance by changing this value.
> >> I see it a pretty critical so I am wanting to understand more about
> this.
> >> Thanks
> >> Lewis
> >>
> >> --
> >> Lewis
> >>
> >>
> >
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: mapreduce.GoraRecordWriter configuration settings

Posted by Lewis John Mcgibbney <le...@gmail.com>.
 Thanks for the input Roland. I share a similar use case.
@Renato, the gora.write.buffer.limit property can be overridden within the
Hadoop Configuration. AFAIK we can override in nutch-site.xml if using
Nutch or core-site.xml if using Gora over hadoop.
This is the way I have been tinkering.
I was curious as to obtaining performance gains.


On Tuesday, March 5, 2013, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:
> This is a very interesting topic to discuss about thank you for starting
it Lewis (:
> I think we have to think about two different application types, the ones
doing real time processing, and the ones doing batch processing. For the
former, a smaller flush-threshold is probably a better choice, and for
the latter one a value depending on the application should be used i.e.
different applications might consider "batch operations differently".
> Just one quick question here Lewis, is this possible to set this
parameter through the configuration file? or is it always hard-coded? I
think it should be settable from outside Gora without having to recompile
Gora every time we want to change it. What do you guys think?
>
>
> Renato M.
>
> On Mar 5, 2013 7:23 AM, "Roland" <ro...@rvh-gmbh.de> wrote:Hi Lewis,
>>
>> for me (nutch use case) a lower value is better, because of 3 main
reasons:
>> a) load is better distributed for the db backend
>> b) when running the nutch fetcherJob, towards the end of the job you
don't have to wait for gora flushing all data to backend, because it was
mostly done during the fetching
>> c) during debugging you'll get gora/cassandra flushing errors much
earlier
>>
>> I'm running with 1k write buffer for cassandra.
>>
>> --Roland
>>
>> Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney:
>>
>> Hi,
>> We use the above class for write operations in the Nutch InjectorJob.
>> I am writing large URL lists to Cassandra using Gora and wonder if I can
get it working better.
>> Currently I am getting around 10000 writes per 90 seconds. Don't get me
wrong, I am working from a very primitive laptop and right now I am merely
attempting to push the software.
>> What I want to know, is what is the consequence of altering the
BUFFER_LIMIT_WRITE_VALUE?
>> Currently we set a default value of 10K for the limit on this value,
meaning that Gora batches flushes to reflect this value.
>> Is a higher or lower value better? Is there any evidence of better
performance by changing this value.
>> I see it a pretty critical so I am wanting to understand more about this.
>> Thanks
>> Lewis
>>
>> --
>> Lewis
>>
>>
>

-- 
*Lewis*

Re: mapreduce.GoraRecordWriter configuration settings

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
This is a very interesting topic to discuss about thank you for starting it
Lewis (:
I think we have to think about two different application types, the ones
doing real time processing, and the ones doing batch processing. For the
former, a smaller flush-threshold is probably a better choice, and for
the latter one a value depending on the application should be used i.e.
different applications might consider "batch operations differently".
Just one quick question here Lewis, is this possible to set this parameter
through the configuration file? or is it always hard-coded? I think it
should be settable from outside Gora without having to recompile Gora
every time we want to change it. What do you guys think?


Renato M.

On Mar 5, 2013 7:23 AM, "Roland" <ro...@rvh-gmbh.de> wrote:Hi Lewis,

>
> for me (nutch use case) a lower value is better, because of 3 main reasons:
> a) load is better distributed for the db backend
> b) when running the nutch fetcherJob, towards the end of the job you don't
> have to wait for gora flushing all data to backend, because it was mostly
> done during the fetching
> c) during debugging you'll get gora/cassandra flushing errors much earlier
>
> I'm running with 1k write buffer for cassandra.
>
> --Roland
>
> Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney:
>
> Hi,
> We use the above class for write operations in the Nutch InjectorJob.
> I am writing large URL lists to Cassandra using Gora and wonder if I can
> get it working better.
> Currently I am getting around 10000 writes per 90 seconds. Don't get me
> wrong, I am working from a very primitive laptop and right now I am merely
> attempting to push the software.
> What I want to know, is what is the consequence of altering the
> BUFFER_LIMIT_WRITE_VALUE?
> Currently we set a default value of 10K for the limit on this value,
> meaning that Gora batches flushes to reflect this value.
> Is a higher or lower value better? Is there any evidence of better
> performance by changing this value.
> I see it a pretty critical so I am wanting to understand more about this.
> Thanks
> Lewis
>
> --
> *Lewis*
>
>
>

Re: mapreduce.GoraRecordWriter configuration settings

Posted by Roland <ro...@rvh-gmbh.de>.
Hi Lewis,

for me (nutch use case) a lower value is better, because of 3 main reasons:
a) load is better distributed for the db backend
b) when running the nutch fetcherJob, towards the end of the job you 
don't have to wait for gora flushing all data to backend, because it was 
mostly done during the fetching
c) during debugging you'll get gora/cassandra flushing errors much earlier

I'm running with 1k write buffer for cassandra.

--Roland

Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney:
> Hi,
> We use the above class for write operations in the Nutch InjectorJob.
> I am writing large URL lists to Cassandra using Gora and wonder if I 
> can get it working better.
> Currently I am getting around 10000 writes per 90 seconds. Don't get 
> me wrong, I am working from a very primitive laptop and right now I am 
> merely attempting to push the software.
> What I want to know, is what is the consequence of altering the 
> BUFFER_LIMIT_WRITE_VALUE?
> Currently we set a default value of 10K for the limit on this value, 
> meaning that Gora batches flushes to reflect this value.
> Is a higher or lower value better? Is there any evidence of better 
> performance by changing this value.
> I see it a pretty critical so I am wanting to understand more about this.
> Thanks
> Lewis
>
> -- 
> /Lewis/