You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Sergey Weiss <sw...@griddynamics.com> on 2014/10/08 15:18:30 UTC

GORA-227

Hello!

This is to bring attention to the ticket GORA-227
<https://issues.apache.org/jira/browse/GORA-227>. Me and my team have
developed a plugin for Nutch (in 2.x branch fork) and wanted to write a
test similar to TestGenerator
<http://svn.apache.org/viewvc/nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestGenerator.java?view=markup>.
It turned out that TestGenerator is currently disabled and investigation
lead us to tickets NUTCH-1572
<https://issues.apache.org/jira/browse/NUTCH-1572> and GORA-225
<https://issues.apache.org/jira/browse/GORA-225>, which has GORA-227 as a
subtask. We did some debugging and posted a message on the ticket. I'm
copying it here:

Hello!
>
> I have debugged TestGenerator and, from what I saw, it fails due to the
> fact that query is being executed on a different MemStore instance rather
> than one that holds injected web pages. That is, when GeneratorJob inits
> its mapper and reducer, it creates new instance of MemStore for both. Each
> of this two instances hold their internal maps and know nothing about each
> other and MemStore created by TestGenerator (and populated with web pages).
>
> What is the best way to address this issue? Should we somehow amend
> DataStoreFactory to make it return single instance of MemStore or should
> all MemStores share their states? Any suggestions?
>

 A day passed by with no reply, so I figured it might be a good idea to
post it on mailing list.
Any reply is welcome, thank you in advance!

Best regards,
Sergey Weiss

Re: GORA-227

Posted by Renato MarroquĂ­n Mogrovejo <re...@gmail.com>.
Hi Sergey,

I am sorry we missed you on JIRA, but here we are :)
So the main issue for me with MemStore is that it actually is just a
regular HashMap wrapped around Gora. MemStore  lives inside the same JVM
where you have instantiate it. So as you pointed out, when you create two
different instances of MemStore this is actually two different objects. Why
does it work when you use CassandraStore for example? Because there, Gora
will connect to the cluster which holds all data available for everyone.
There was a project for trying to use Hazelcast as an in-memory data store
and to overcome this issue. But I don't think the answer goes that way
because that is just adding another datastore which turns into an extra
dependency.
Over in Apache Giraph, their main concern was having too many external
dependencies, and that is why MemStore actually was a good solution for
performing the tests. Now going back to the test you want to do, yes, the
way to go would be to create a shared MemStore in the test. This is not
really related to Gora but more the way the test had been design over in
Nutch, I mean the test is for checking that Nutch is doing what it is
supposed to do, not really for checking if the data is in there or not,
that is just a result of the test.
I will check Nutch code and then I could provide some more input on this.


Renato M.

2014-10-08 23:01 GMT+02:00 Henry Saputra <he...@gmail.com>:

> Hi Sergey,
>
> There is an attempt to make MemStore like other stores:
> https://issues.apache.org/jira/browse/GORA-228
>
> But to just fix tests I think we could just use static
> ConcurrentHashMap or Collections.synchronizedMap(new LinkedHashMap) as
> the backing store for MemStore for now.
>
> CC @Lewis and @Renato
>
> - Henry
>
> On Wed, Oct 8, 2014 at 6:18 AM, Sergey Weiss <sw...@griddynamics.com>
> wrote:
> > Hello!
> >
> > This is to bring attention to the ticket GORA-227
> > <https://issues.apache.org/jira/browse/GORA-227>. Me and my team have
> > developed a plugin for Nutch (in 2.x branch fork) and wanted to write a
> > test similar to TestGenerator
> > <
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestGenerator.java?view=markup
> >.
> > It turned out that TestGenerator is currently disabled and investigation
> > lead us to tickets NUTCH-1572
> > <https://issues.apache.org/jira/browse/NUTCH-1572> and GORA-225
> > <https://issues.apache.org/jira/browse/GORA-225>, which has GORA-227 as
> a
> > subtask. We did some debugging and posted a message on the ticket. I'm
> > copying it here:
> >
> > Hello!
> >>
> >> I have debugged TestGenerator and, from what I saw, it fails due to the
> >> fact that query is being executed on a different MemStore instance
> rather
> >> than one that holds injected web pages. That is, when GeneratorJob inits
> >> its mapper and reducer, it creates new instance of MemStore for both.
> Each
> >> of this two instances hold their internal maps and know nothing about
> each
> >> other and MemStore created by TestGenerator (and populated with web
> pages).
> >>
> >> What is the best way to address this issue? Should we somehow amend
> >> DataStoreFactory to make it return single instance of MemStore or should
> >> all MemStores share their states? Any suggestions?
> >>
> >
> >  A day passed by with no reply, so I figured it might be a good idea to
> > post it on mailing list.
> > Any reply is welcome, thank you in advance!
> >
> > Best regards,
> > Sergey Weiss
>

Re: GORA-227

Posted by Henry Saputra <he...@gmail.com>.
Hi Sergey,

There is an attempt to make MemStore like other stores:
https://issues.apache.org/jira/browse/GORA-228

But to just fix tests I think we could just use static
ConcurrentHashMap or Collections.synchronizedMap(new LinkedHashMap) as
the backing store for MemStore for now.

CC @Lewis and @Renato

- Henry

On Wed, Oct 8, 2014 at 6:18 AM, Sergey Weiss <sw...@griddynamics.com> wrote:
> Hello!
>
> This is to bring attention to the ticket GORA-227
> <https://issues.apache.org/jira/browse/GORA-227>. Me and my team have
> developed a plugin for Nutch (in 2.x branch fork) and wanted to write a
> test similar to TestGenerator
> <http://svn.apache.org/viewvc/nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestGenerator.java?view=markup>.
> It turned out that TestGenerator is currently disabled and investigation
> lead us to tickets NUTCH-1572
> <https://issues.apache.org/jira/browse/NUTCH-1572> and GORA-225
> <https://issues.apache.org/jira/browse/GORA-225>, which has GORA-227 as a
> subtask. We did some debugging and posted a message on the ticket. I'm
> copying it here:
>
> Hello!
>>
>> I have debugged TestGenerator and, from what I saw, it fails due to the
>> fact that query is being executed on a different MemStore instance rather
>> than one that holds injected web pages. That is, when GeneratorJob inits
>> its mapper and reducer, it creates new instance of MemStore for both. Each
>> of this two instances hold their internal maps and know nothing about each
>> other and MemStore created by TestGenerator (and populated with web pages).
>>
>> What is the best way to address this issue? Should we somehow amend
>> DataStoreFactory to make it return single instance of MemStore or should
>> all MemStores share their states? Any suggestions?
>>
>
>  A day passed by with no reply, so I figured it might be a good idea to
> post it on mailing list.
> Any reply is welcome, thank you in advance!
>
> Best regards,
> Sergey Weiss