You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by lewis john mcgibbney <le...@gmail.com> on 2011/08/29 22:41:25 UTC

Re: Gora CassandraStore is not thread safe?

Hi Tom,

Apologies for cross posting, this would not usually be the case but I'm
hoping that if any results come from the thread then both communities can
benefit.

I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
Gora 0.2 myself and seem to be having some nasty problems.

Some questions for you

1) How are you running Nutch local or deploy?
2) How are you running Cassandra, local or deployed in a cluster?

The obvious thoughts are that this is a bug and that there are
method(s)/object(s) which are not safe.

Have you gotten any further with this?

Lewis


On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:

> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>  The nutch 2 fetcher architecture has many threads writing to one
> GoraRecordWriter and I am getting concurrent modification errors like below.
>
> Caused by: java.util.ConcurrentModificationException
>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>               at
> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>               at
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>
>
>
>
>
>


-- 
*Lewis*

Re: Gora CassandraStore is not thread safe?

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi,

I think were at a stage where you're right Chris. Further to Alexis' commit,
I feel that this has been bottomed out. Further to this, we are now at
Cassandra version 0.8.1.
Are you happy with this Alexis?

Thanks

On Sat, Oct 1, 2011 at 6:33 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Great work, thanks Alexis! Maybe it's time to close out GORA-22 then
> and leave any future things that crop up as new issues.
>
> Cheers,
> Chris
>
> On Oct 1, 2011, at 4:07 AM, Alexis wrote:
>
> > Last revision 1177960 should now fix the thread-safe issue:
> >
> >
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java?r1=1177960&r2=1177959&pathrev=1177960
> >
> > Please comment on https://issues.apache.org/jira/browse/GORA-22 if
> > there is anything else.
> >
> > Alexis
> >
> > On Sun, Sep 4, 2011 at 10:43 AM, Alexis <al...@gmail.com>
> wrote:
> >> Hi,
> >>
> >> I submitted the patch for peer review by just attaching it to the
> >> issue: https://issues.apache.org/jira/browse/GORA-22
> >>
> >> See this article about concurreny and hashmap to read about the topic:
> >> http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html
> >>
> >> I ended up calling toArray over the key set to get around the
> >> ConcurrentModificationException thrown by defaut with
> >> java.util.HashMap when iterating over the keys.
> >>
> >> Not that many times I encountered Cassandra crashes and Hector
> >> exceptions (usually because of GC triggered by Cassandra daemon?) with
> >> my poor 5-year-old laptop while running Nutch parse command, which is
> >> very CPU and IO intensive. In mapred-site.xml, see attached config, it
> >> worked out when you make the read batch reasonable (400 rows at a
> >> time) and try to separate it from the write batch (for example 843
> >> written rows per batch) so that they don't happen simultaneously.
> >>
> >>
> >> Alexis
> >>
> >> On Tue, Aug 30, 2011 at 1:24 AM, Alexis <al...@gmail.com>
> wrote:
> >>> Hi Tom,
> >>>
> >>> Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious
> >>> bug. I must say there is not a very active development and testing on
> >>> Gora & Nutch, but at least there is some.
> >>>
> >>>
> >>> 1. As regards your ConcurrentModification issue, it looks like it
> >>> happens when flushing the store. From your exception stacktrace:
> >>> (Line 192 in org.apache.gora.cassandra.store.CassandraStore)
> >>>    for (K key: this.buffer.keySet()) {
> >>>
> >>> while there are other threads adding new keys to the HashMap:
> >>>
> >>> (Line 266)
> >>>    this.buffer.put(key, p);
> >>>
> >>> "it is not generally permissible for one thread to modify a Collection
> >>> while another thread is iterating over it."
> >>>
> >>> Let me try to reproduce the bug and fix it with this in mind:
> >>> How about introducing some mutex / lock mechanism witch
> >>> java.util.concurrent.locks.Lock or easier, using a thread-safe
> >>> implementation such as java.util.concurrent.ConcurrentHashMap?
> >>>
> >>>
> >>> 2. Regarding the OutOfMemory error, maybe decreasing the flushing
> >>> frecuency as described here?
> >>>
> http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
> >>>
> >>> I like to use the jvisualvm utility from the JDK that monitors the
> >>> memory usage and tells you how this evolves during the execution of
> >>> the class...
> >>>
> >>> Alexis
> >>>
> >>> On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <td...@covario.com>
> wrote:
> >>>> Hi Lewis,
> >>>>
> >>>> I was running Nutch deployed with a dedicated Cassandra cluster.
> Frankly, I have given up on using Nutch 2 at this time as it seems highly
> unstable and not really in active development. Your effort to address this
> is encouraging. Because Nutch uses multithreading in the fetchers, I was
> getting ConcurrentModification errors and OutOfMemory errors on a regular
> basis in the CassandraStore. As far as I recall, the caching/flushing
> implementation is just not thread safe. If the CassandraStore caching was
> completely removed it may work, but would probably not be very efficient.
>  If I were to fix this class, I would try to rewrite it to use Hector
> batched mutations instead.
> >>>>
> >>>> Tom
> >>>>
> >>>> -----Original Message-----
> >>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> >>>> Sent: Monday, August 29, 2011 1:41 PM
> >>>> To: gora-dev@incubator.apache.org; dev@nutch.apache.org
> >>>> Subject: Re: Gora CassandraStore is not thread safe?
> >>>>
> >>>> Hi Tom,
> >>>>
> >>>> Apologies for cross posting, this would not usually be the case but
> I'm
> >>>> hoping that if any results come from the thread then both communities
> can
> >>>> benefit.
> >>>>
> >>>> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0
> and
> >>>> Gora 0.2 myself and seem to be having some nasty problems.
> >>>>
> >>>> Some questions for you
> >>>>
> >>>> 1) How are you running Nutch local or deploy?
> >>>> 2) How are you running Cassandra, local or deployed in a cluster?
> >>>>
> >>>> The obvious thoughts are that this is a bug and that there are
> >>>> method(s)/object(s) which are not safe.
> >>>>
> >>>> Have you gotten any further with this?
> >>>>
> >>>> Lewis
> >>>>
> >>>>
> >>>> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com>
> wrote:
> >>>>
> >>>>> Has anyone tested the CassandraStore in gora 0.2 using multiple
> threads?
> >>>>>  The nutch 2 fetcher architecture has many threads writing to one
> >>>>> GoraRecordWriter and I am getting concurrent modification errors like
> below.
> >>>>>
> >>>>> Caused by: java.util.ConcurrentModificationException
> >>>>>               at
> java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
> >>>>>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
> >>>>>               at
> >>>>>
> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
> >>>>>               at
> >>>>>
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> *Lewis*
> >>>>
> >>>
> >>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*Lewis*

Re: Gora CassandraStore is not thread safe?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Great work, thanks Alexis! Maybe it's time to close out GORA-22 then 
and leave any future things that crop up as new issues. 

Cheers,
Chris

On Oct 1, 2011, at 4:07 AM, Alexis wrote:

> Last revision 1177960 should now fix the thread-safe issue:
> 
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java?r1=1177960&r2=1177959&pathrev=1177960
> 
> Please comment on https://issues.apache.org/jira/browse/GORA-22 if
> there is anything else.
> 
> Alexis
> 
> On Sun, Sep 4, 2011 at 10:43 AM, Alexis <al...@gmail.com> wrote:
>> Hi,
>> 
>> I submitted the patch for peer review by just attaching it to the
>> issue: https://issues.apache.org/jira/browse/GORA-22
>> 
>> See this article about concurreny and hashmap to read about the topic:
>> http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html
>> 
>> I ended up calling toArray over the key set to get around the
>> ConcurrentModificationException thrown by defaut with
>> java.util.HashMap when iterating over the keys.
>> 
>> Not that many times I encountered Cassandra crashes and Hector
>> exceptions (usually because of GC triggered by Cassandra daemon?) with
>> my poor 5-year-old laptop while running Nutch parse command, which is
>> very CPU and IO intensive. In mapred-site.xml, see attached config, it
>> worked out when you make the read batch reasonable (400 rows at a
>> time) and try to separate it from the write batch (for example 843
>> written rows per batch) so that they don't happen simultaneously.
>> 
>> 
>> Alexis
>> 
>> On Tue, Aug 30, 2011 at 1:24 AM, Alexis <al...@gmail.com> wrote:
>>> Hi Tom,
>>> 
>>> Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious
>>> bug. I must say there is not a very active development and testing on
>>> Gora & Nutch, but at least there is some.
>>> 
>>> 
>>> 1. As regards your ConcurrentModification issue, it looks like it
>>> happens when flushing the store. From your exception stacktrace:
>>> (Line 192 in org.apache.gora.cassandra.store.CassandraStore)
>>>    for (K key: this.buffer.keySet()) {
>>> 
>>> while there are other threads adding new keys to the HashMap:
>>> 
>>> (Line 266)
>>>    this.buffer.put(key, p);
>>> 
>>> "it is not generally permissible for one thread to modify a Collection
>>> while another thread is iterating over it."
>>> 
>>> Let me try to reproduce the bug and fix it with this in mind:
>>> How about introducing some mutex / lock mechanism witch
>>> java.util.concurrent.locks.Lock or easier, using a thread-safe
>>> implementation such as java.util.concurrent.ConcurrentHashMap?
>>> 
>>> 
>>> 2. Regarding the OutOfMemory error, maybe decreasing the flushing
>>> frecuency as described here?
>>> http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
>>> 
>>> I like to use the jvisualvm utility from the JDK that monitors the
>>> memory usage and tells you how this evolves during the execution of
>>> the class...
>>> 
>>> Alexis
>>> 
>>> On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <td...@covario.com> wrote:
>>>> Hi Lewis,
>>>> 
>>>> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I have given up on using Nutch 2 at this time as it seems highly unstable and not really in active development. Your effort to address this is encouraging. Because Nutch uses multithreading in the fetchers, I was getting ConcurrentModification errors and OutOfMemory errors on a regular basis in the CassandraStore. As far as I recall, the caching/flushing implementation is just not thread safe. If the CassandraStore caching was completely removed it may work, but would probably not be very efficient.  If I were to fix this class, I would try to rewrite it to use Hector batched mutations instead.
>>>> 
>>>> Tom
>>>> 
>>>> -----Original Message-----
>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
>>>> Sent: Monday, August 29, 2011 1:41 PM
>>>> To: gora-dev@incubator.apache.org; dev@nutch.apache.org
>>>> Subject: Re: Gora CassandraStore is not thread safe?
>>>> 
>>>> Hi Tom,
>>>> 
>>>> Apologies for cross posting, this would not usually be the case but I'm
>>>> hoping that if any results come from the thread then both communities can
>>>> benefit.
>>>> 
>>>> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
>>>> Gora 0.2 myself and seem to be having some nasty problems.
>>>> 
>>>> Some questions for you
>>>> 
>>>> 1) How are you running Nutch local or deploy?
>>>> 2) How are you running Cassandra, local or deployed in a cluster?
>>>> 
>>>> The obvious thoughts are that this is a bug and that there are
>>>> method(s)/object(s) which are not safe.
>>>> 
>>>> Have you gotten any further with this?
>>>> 
>>>> Lewis
>>>> 
>>>> 
>>>> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:
>>>> 
>>>>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>>>>>  The nutch 2 fetcher architecture has many threads writing to one
>>>>> GoraRecordWriter and I am getting concurrent modification errors like below.
>>>>> 
>>>>> Caused by: java.util.ConcurrentModificationException
>>>>>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>>>>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>>>>               at
>>>>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>>>>>               at
>>>>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Lewis*
>>>> 
>>> 
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Gora CassandraStore is not thread safe?

Posted by Alexis <al...@gmail.com>.
Last revision 1177960 should now fix the thread-safe issue:

http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java?r1=1177960&r2=1177959&pathrev=1177960

Please comment on https://issues.apache.org/jira/browse/GORA-22 if
there is anything else.

Alexis

On Sun, Sep 4, 2011 at 10:43 AM, Alexis <al...@gmail.com> wrote:
> Hi,
>
> I submitted the patch for peer review by just attaching it to the
> issue: https://issues.apache.org/jira/browse/GORA-22
>
> See this article about concurreny and hashmap to read about the topic:
> http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html
>
> I ended up calling toArray over the key set to get around the
> ConcurrentModificationException thrown by defaut with
> java.util.HashMap when iterating over the keys.
>
> Not that many times I encountered Cassandra crashes and Hector
> exceptions (usually because of GC triggered by Cassandra daemon?) with
> my poor 5-year-old laptop while running Nutch parse command, which is
> very CPU and IO intensive. In mapred-site.xml, see attached config, it
> worked out when you make the read batch reasonable (400 rows at a
> time) and try to separate it from the write batch (for example 843
> written rows per batch) so that they don't happen simultaneously.
>
>
> Alexis
>
> On Tue, Aug 30, 2011 at 1:24 AM, Alexis <al...@gmail.com> wrote:
>> Hi Tom,
>>
>> Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious
>> bug. I must say there is not a very active development and testing on
>> Gora & Nutch, but at least there is some.
>>
>>
>> 1. As regards your ConcurrentModification issue, it looks like it
>> happens when flushing the store. From your exception stacktrace:
>> (Line 192 in org.apache.gora.cassandra.store.CassandraStore)
>>    for (K key: this.buffer.keySet()) {
>>
>> while there are other threads adding new keys to the HashMap:
>>
>> (Line 266)
>>    this.buffer.put(key, p);
>>
>> "it is not generally permissible for one thread to modify a Collection
>> while another thread is iterating over it."
>>
>> Let me try to reproduce the bug and fix it with this in mind:
>> How about introducing some mutex / lock mechanism witch
>> java.util.concurrent.locks.Lock or easier, using a thread-safe
>> implementation such as java.util.concurrent.ConcurrentHashMap?
>>
>>
>> 2. Regarding the OutOfMemory error, maybe decreasing the flushing
>> frecuency as described here?
>> http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
>>
>> I like to use the jvisualvm utility from the JDK that monitors the
>> memory usage and tells you how this evolves during the execution of
>> the class...
>>
>> Alexis
>>
>> On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <td...@covario.com> wrote:
>>> Hi Lewis,
>>>
>>> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I have given up on using Nutch 2 at this time as it seems highly unstable and not really in active development. Your effort to address this is encouraging. Because Nutch uses multithreading in the fetchers, I was getting ConcurrentModification errors and OutOfMemory errors on a regular basis in the CassandraStore. As far as I recall, the caching/flushing implementation is just not thread safe. If the CassandraStore caching was completely removed it may work, but would probably not be very efficient.  If I were to fix this class, I would try to rewrite it to use Hector batched mutations instead.
>>>
>>> Tom
>>>
>>> -----Original Message-----
>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
>>> Sent: Monday, August 29, 2011 1:41 PM
>>> To: gora-dev@incubator.apache.org; dev@nutch.apache.org
>>> Subject: Re: Gora CassandraStore is not thread safe?
>>>
>>> Hi Tom,
>>>
>>> Apologies for cross posting, this would not usually be the case but I'm
>>> hoping that if any results come from the thread then both communities can
>>> benefit.
>>>
>>> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
>>> Gora 0.2 myself and seem to be having some nasty problems.
>>>
>>> Some questions for you
>>>
>>> 1) How are you running Nutch local or deploy?
>>> 2) How are you running Cassandra, local or deployed in a cluster?
>>>
>>> The obvious thoughts are that this is a bug and that there are
>>> method(s)/object(s) which are not safe.
>>>
>>> Have you gotten any further with this?
>>>
>>> Lewis
>>>
>>>
>>> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:
>>>
>>>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>>>>  The nutch 2 fetcher architecture has many threads writing to one
>>>> GoraRecordWriter and I am getting concurrent modification errors like below.
>>>>
>>>> Caused by: java.util.ConcurrentModificationException
>>>>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>>>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>>>               at
>>>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>>>>               at
>>>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>

Re: Gora CassandraStore is not thread safe?

Posted by Alexis <al...@gmail.com>.
Hi,

I submitted the patch for peer review by just attaching it to the
issue: https://issues.apache.org/jira/browse/GORA-22

See this article about concurreny and hashmap to read about the topic:
http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html

I ended up calling toArray over the key set to get around the
ConcurrentModificationException thrown by defaut with
java.util.HashMap when iterating over the keys.

Not that many times I encountered Cassandra crashes and Hector
exceptions (usually because of GC triggered by Cassandra daemon?) with
my poor 5-year-old laptop while running Nutch parse command, which is
very CPU and IO intensive. In mapred-site.xml, see attached config, it
worked out when you make the read batch reasonable (400 rows at a
time) and try to separate it from the write batch (for example 843
written rows per batch) so that they don't happen simultaneously.


Alexis

On Tue, Aug 30, 2011 at 1:24 AM, Alexis <al...@gmail.com> wrote:
> Hi Tom,
>
> Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious
> bug. I must say there is not a very active development and testing on
> Gora & Nutch, but at least there is some.
>
>
> 1. As regards your ConcurrentModification issue, it looks like it
> happens when flushing the store. From your exception stacktrace:
> (Line 192 in org.apache.gora.cassandra.store.CassandraStore)
>    for (K key: this.buffer.keySet()) {
>
> while there are other threads adding new keys to the HashMap:
>
> (Line 266)
>    this.buffer.put(key, p);
>
> "it is not generally permissible for one thread to modify a Collection
> while another thread is iterating over it."
>
> Let me try to reproduce the bug and fix it with this in mind:
> How about introducing some mutex / lock mechanism witch
> java.util.concurrent.locks.Lock or easier, using a thread-safe
> implementation such as java.util.concurrent.ConcurrentHashMap?
>
>
> 2. Regarding the OutOfMemory error, maybe decreasing the flushing
> frecuency as described here?
> http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
>
> I like to use the jvisualvm utility from the JDK that monitors the
> memory usage and tells you how this evolves during the execution of
> the class...
>
> Alexis
>
> On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <td...@covario.com> wrote:
>> Hi Lewis,
>>
>> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I have given up on using Nutch 2 at this time as it seems highly unstable and not really in active development. Your effort to address this is encouraging. Because Nutch uses multithreading in the fetchers, I was getting ConcurrentModification errors and OutOfMemory errors on a regular basis in the CassandraStore. As far as I recall, the caching/flushing implementation is just not thread safe. If the CassandraStore caching was completely removed it may work, but would probably not be very efficient.  If I were to fix this class, I would try to rewrite it to use Hector batched mutations instead.
>>
>> Tom
>>
>> -----Original Message-----
>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
>> Sent: Monday, August 29, 2011 1:41 PM
>> To: gora-dev@incubator.apache.org; dev@nutch.apache.org
>> Subject: Re: Gora CassandraStore is not thread safe?
>>
>> Hi Tom,
>>
>> Apologies for cross posting, this would not usually be the case but I'm
>> hoping that if any results come from the thread then both communities can
>> benefit.
>>
>> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
>> Gora 0.2 myself and seem to be having some nasty problems.
>>
>> Some questions for you
>>
>> 1) How are you running Nutch local or deploy?
>> 2) How are you running Cassandra, local or deployed in a cluster?
>>
>> The obvious thoughts are that this is a bug and that there are
>> method(s)/object(s) which are not safe.
>>
>> Have you gotten any further with this?
>>
>> Lewis
>>
>>
>> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:
>>
>>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>>>  The nutch 2 fetcher architecture has many threads writing to one
>>> GoraRecordWriter and I am getting concurrent modification errors like below.
>>>
>>> Caused by: java.util.ConcurrentModificationException
>>>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>>               at
>>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>>>               at
>>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>

Re: Gora CassandraStore is not thread safe?

Posted by Alexis <al...@gmail.com>.
Hi Tom,

Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious
bug. I must say there is not a very active development and testing on
Gora & Nutch, but at least there is some.


1. As regards your ConcurrentModification issue, it looks like it
happens when flushing the store. From your exception stacktrace:
(Line 192 in org.apache.gora.cassandra.store.CassandraStore)
    for (K key: this.buffer.keySet()) {

while there are other threads adding new keys to the HashMap:

(Line 266)
    this.buffer.put(key, p);

"it is not generally permissible for one thread to modify a Collection
while another thread is iterating over it."

Let me try to reproduce the bug and fix it with this in mind:
How about introducing some mutex / lock mechanism witch
java.util.concurrent.locks.Lock or easier, using a thread-safe
implementation such as java.util.concurrent.ConcurrentHashMap?


2. Regarding the OutOfMemory error, maybe decreasing the flushing
frecuency as described here?
http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency

I like to use the jvisualvm utility from the JDK that monitors the
memory usage and tells you how this evolves during the execution of
the class...

Alexis

On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <td...@covario.com> wrote:
> Hi Lewis,
>
> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I have given up on using Nutch 2 at this time as it seems highly unstable and not really in active development. Your effort to address this is encouraging. Because Nutch uses multithreading in the fetchers, I was getting ConcurrentModification errors and OutOfMemory errors on a regular basis in the CassandraStore. As far as I recall, the caching/flushing implementation is just not thread safe. If the CassandraStore caching was completely removed it may work, but would probably not be very efficient.  If I were to fix this class, I would try to rewrite it to use Hector batched mutations instead.
>
> Tom
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Monday, August 29, 2011 1:41 PM
> To: gora-dev@incubator.apache.org; dev@nutch.apache.org
> Subject: Re: Gora CassandraStore is not thread safe?
>
> Hi Tom,
>
> Apologies for cross posting, this would not usually be the case but I'm
> hoping that if any results come from the thread then both communities can
> benefit.
>
> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
> Gora 0.2 myself and seem to be having some nasty problems.
>
> Some questions for you
>
> 1) How are you running Nutch local or deploy?
> 2) How are you running Cassandra, local or deployed in a cluster?
>
> The obvious thoughts are that this is a bug and that there are
> method(s)/object(s) which are not safe.
>
> Have you gotten any further with this?
>
> Lewis
>
>
> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:
>
>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>>  The nutch 2 fetcher architecture has many threads writing to one
>> GoraRecordWriter and I am getting concurrent modification errors like below.
>>
>> Caused by: java.util.ConcurrentModificationException
>>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>               at
>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>>               at
>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>>
>>
>>
>>
>>
>>
>
>
> --
> *Lewis*
>

RE: Gora CassandraStore is not thread safe?

Posted by Tom Davidson <td...@covario.com>.
Hi Lewis,

I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I have given up on using Nutch 2 at this time as it seems highly unstable and not really in active development. Your effort to address this is encouraging. Because Nutch uses multithreading in the fetchers, I was getting ConcurrentModification errors and OutOfMemory errors on a regular basis in the CassandraStore. As far as I recall, the caching/flushing implementation is just not thread safe. If the CassandraStore caching was completely removed it may work, but would probably not be very efficient.  If I were to fix this class, I would try to rewrite it to use Hector batched mutations instead.

Tom

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Monday, August 29, 2011 1:41 PM
To: gora-dev@incubator.apache.org; dev@nutch.apache.org
Subject: Re: Gora CassandraStore is not thread safe?

Hi Tom,

Apologies for cross posting, this would not usually be the case but I'm
hoping that if any results come from the thread then both communities can
benefit.

I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
Gora 0.2 myself and seem to be having some nasty problems.

Some questions for you

1) How are you running Nutch local or deploy?
2) How are you running Cassandra, local or deployed in a cluster?

The obvious thoughts are that this is a bug and that there are
method(s)/object(s) which are not safe.

Have you gotten any further with this?

Lewis


On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:

> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>  The nutch 2 fetcher architecture has many threads writing to one
> GoraRecordWriter and I am getting concurrent modification errors like below.
>
> Caused by: java.util.ConcurrentModificationException
>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>               at
> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>               at
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>
>
>
>
>
>


-- 
*Lewis*

RE: Gora CassandraStore is not thread safe?

Posted by Tom Davidson <td...@covario.com>.
Hi Lewis,

I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I have given up on using Nutch 2 at this time as it seems highly unstable and not really in active development. Your effort to address this is encouraging. Because Nutch uses multithreading in the fetchers, I was getting ConcurrentModification errors and OutOfMemory errors on a regular basis in the CassandraStore. As far as I recall, the caching/flushing implementation is just not thread safe. If the CassandraStore caching was completely removed it may work, but would probably not be very efficient.  If I were to fix this class, I would try to rewrite it to use Hector batched mutations instead.

Tom

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Monday, August 29, 2011 1:41 PM
To: gora-dev@incubator.apache.org; dev@nutch.apache.org
Subject: Re: Gora CassandraStore is not thread safe?

Hi Tom,

Apologies for cross posting, this would not usually be the case but I'm
hoping that if any results come from the thread then both communities can
benefit.

I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
Gora 0.2 myself and seem to be having some nasty problems.

Some questions for you

1) How are you running Nutch local or deploy?
2) How are you running Cassandra, local or deployed in a cluster?

The obvious thoughts are that this is a bug and that there are
method(s)/object(s) which are not safe.

Have you gotten any further with this?

Lewis


On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <td...@covario.com> wrote:

> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>  The nutch 2 fetcher architecture has many threads writing to one
> GoraRecordWriter and I am getting concurrent modification errors like below.
>
> Caused by: java.util.ConcurrentModificationException
>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>               at
> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>               at
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>
>
>
>
>
>


-- 
*Lewis*