You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Jörn Kottmann <ko...@gmail.com> on 2011/02/03 14:40:59 UTC

Solrcas questions

Hi,

I have two questions about the Solrcas project in the sandbox.

Reviewed the code, is there any special reason that SolrCASConsumer
does extend JCasAnnotator_ImplBase ? It looks like it is not using any
JCas features and could also extend CasAnnotator_ImplBase instead.
But maybe I am mistaken.

In line 132 the SolrServer.add method is called inside the AEs process 
method.
Does this method already transmit the document over the network into Solr ?
Or does this happens in the line after where SolrServer.commit is called.
I am asking because if we could use auto commit the Solr Server process
might be able to group multiple documents into one commit and then
we would not need to call SolrServer.commit for every document. The commit
behavior could be configurable.

In case add does not transmit the document synchronously to the Solr Server,
the process method can return but the CAS might still cause an error in 
a future
call to the process method which I do not want, because it makes the error
handling complicated.

Thanks,
Jörn

Re: Solrcas questions

Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/4/11 5:48 PM, Tommaso Teofili wrote:
> Regarding the embedded Solr server I think it can be worth to have this
> option also for non testing scenarios as it avoid a lot of network overhead
> so one could take advantage of it if the two instances (Solr and UIMA) are
> on the same machine.
>

I do not agree, when to Solr server process is running on the same machine
the overhead to communicate with that process via network sockets is 
negligible.
In case the user changes his mind, it is really easy to relocate the 
Solr server to a new
machine with just changing the URL in the descriptor or the DNS entry.

There are even more advantages the processing pipeline can be 
updated/changed
without taking down the Solr server which might serve some other systems.

And the Solr developers also suggest not to embed Solr if not really 
necessary.

"The simplest, safest, way to use Solr is via Solr's standard HTTP 
interfaces. Embedding Solr
is less flexible, harder to support, not as well tested, and should be 
reserved for special circumstances."
(http://wiki.apache.org/solr/EmbeddedSolr)

In our case it has also the disadvantage that Solrcas is harder to use 
because it is not clear that all
the Solr and Lucene dependencies are not required.

In the end I think it is fine for testing, and if some one really has 
"special circumstances" he can
invest a few minutes and overwrite the createServer method. Having all 
these dependencies makes it
inconvenient to use for the suggested and preferred use case.

Jörn

Re: Solrcas questions

Posted by Jörn Kottmann <ko...@gmail.com>.
One more thing.

In the initialize method you do the following:

String solrInstanceTypeParam = 
String.valueOf(context.getConfigParameterValue("solrInstanceType"));
assert solrInstanceTypeParam != null;

The assert will always be true since String.valueOf never returns a null 
reference, even when
you pass a null reference to it, it will return the string "null".

getConfigParameterValue can actually return null, and we should fail the 
initialization if
not all necessary parameters to run it are specified in the descriptor. 
It should fail
with a meaningful error message.
Using assert is not a safe way to fail the initialization because 
asserts are usually
disabled.

And, in the process method the try-catch statement just wraps the entire 
implementation,
but only the calls to SolrServer.add and SolrServer.commit (and 
Class.getConstructor, but that
one will vanish when CAS API is used instead, it is better for Solrcas 
anyway) can fail
I believe it is always good idea to minimize the scope of try-catch 
statements, as long as there
is no reason not to do it. The wrapped and re-thrown exception could 
explain that
adding or commiting the document to the solr server failed. That is then 
also easier to
debug in my production system where this AE will run and adding a 
document to my solr
server will fail for real from time to time for various reasons.
I am not sure what is the best error handling when commit is failing, 
because in that
case the document might be added again to solr which might lead to 
duplicate documents
in the server.

Jörn

Re: Solrcas questions

Posted by Tommaso Teofili <to...@gmail.com>.
Hi Jorn,
good points :)

2011/2/3 Jörn Kottmann <ko...@gmail.com>

> Hi,
>
> I have two questions about the Solrcas project in the sandbox.
>
> Reviewed the code, is there any special reason that SolrCASConsumer
> does extend JCasAnnotator_ImplBase ? It looks like it is not using any
> JCas features and could also extend CasAnnotator_ImplBase instead.
> But maybe I am mistaken.
>

Thanks for notifying, I'm not at my computer so cannot have a deeper look
but surely do as soon as I can.


>
> In line 132 the SolrServer.add method is called inside the AEs process
> method.
> Does this method already transmit the document over the network into Solr ?
> Or does this happens in the line after where SolrServer.commit is called.
> I am asking because if we could use auto commit the Solr Server process
> might be able to group multiple documents into one commit and then
> we would not need to call SolrServer.commit for every document. The commit
> behavior could be configurable.
>
> In case add does not transmit the document synchronously to the Solr
> Server,
> the process method can return but the CAS might still cause an error in a
> future
> call to the process method which I do not want, because it makes the error
> handling complicated.


The autocommit option could be leveraged adding a parameter within the
CASConsumer descriptor to support such a scenario or otherwise that property
can be derived "downloading" the solrconfig.xml (via a Solr REST call) and
behave according to that. Regarding the add method I think it's synchronous,
as well as the commit one, but need to check better on Solr code.
Regarding the embedded Solr server I think it can be worth to have this
option also for non testing scenarios as it avoid a lot of network overhead
so one could take advantage of it if the two instances (Solr and UIMA) are
on the same machine.


2011/2/4 Jörn Kottmann <ko...@gmail.com>

> And some more.
>
> This is defined as dependency in the pom:
> <dependency>
> <groupId>org.apache.uima</groupId>
> <artifactId>uimaj-component-test-util</artifactId>
> <version>2.3.1</version>
> </dependency>
>
> Shouldn't that be a test dependency only ?
>

 I forgot to add the <scope>test</scope> tag for uimaj-component-test-util,
need to fix it.


> Otherwise it indicated that the jar plus dependencies should be
> on the classpath while deployed.
>
> I also wonder which of all these jars I really need to run SolrJ. It looks
> like that most are needed for the embedded solr server.
> Does it make sense to use the embedded solr server for anything
> else than testing ?
>
> Should we declare createServer as protected ? Than people can overwrite
> it to create what ever kind of solr server they want or maybe
> tune/customize
> the http parameters.
>

this sounds a good option to provide customizations so I am +1 for that
change


>
> Thanks,
> Jörn
>

Re: Solrcas questions

Posted by Jörn Kottmann <ko...@gmail.com>.
And some more.

This is defined as dependency in the pom:
<dependency>
<groupId>org.apache.uima</groupId>
<artifactId>uimaj-component-test-util</artifactId>
<version>2.3.1</version>
</dependency>

Shouldn't that be a test dependency only ?
Otherwise it indicated that the jar plus dependencies should be
on the classpath while deployed.

I also wonder which of all these jars I really need to run SolrJ. It looks
like that most are needed for the embedded solr server.
Does it make sense to use the embedded solr server for anything
else than testing ?

Should we declare createServer as protected ? Than people can overwrite
it to create what ever kind of solr server they want or maybe tune/customize
the http parameters.

Thanks,
Jörn

Re: Solrcas questions

Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/9/11 3:47 PM, Tommaso Teofili wrote:
> regarding asserts in the initialize() method they can be safely removed as they were put there mainly for debugging purpose, however the initialization of the Consumer would fail if such params are null or badly defined as you can see inside the createServer(type,path) and inside the FieldMappingReader.getConf(path) methods

Lets open a jira for this one.
> the cas element in the mapping file is an optional one and I thought it was useful to track the cas which delivered information, in the sample file it gets mapped inside an id field but it doesn't mean it MUST be unique; however that is optional and maybe the toString() method isn't the best one to store the cas information, but I still think it makes sense to not loose such an information.

I believe in the very most cases it is really not unique. People can 
have a FS in the cas which contains a unique id, that
can be easily mapped to an id field in solr. The current implementation 
can do that already. I also believe the
toString value it not all helpful to debug anything. You might want to 
log debug information into the CAS.
If you wish to keep that in solr, it would be possible to simply map 
these FSes.

> I agree with the need to switch to the CAS API
Then lets open a jira for it.
> I agree also regarding the enhancing the exception handling for debugging errors; if commit fails I think that should be handled the same way as an add() fails otherwise it should be created a commit policy (i.e. a cache of documents previously added to try to re-send them) parameter but I think it's out of the scope of a basic Solrcas implementation and more related to how Solr handles commit errors
> I'd introduce the already discussed autocommit configuration parameter (boolean) to indicate if Solrcas should also send a commit to the SolrServer (it may also make sense to create a third value for this param called 'destroy' that would trigger the commit only on the destroy() method even if in that case any errors during the commit could not be recovered)

When there is not a unique id the document will be added again into solr 
when commit failed the first time. Not sure what is the
best way to handle these errors. In some cases you might just want to 
ignore it, in other you might want to retry. I also wonder if
autocommit is not the best option when there is a massive amount of 
documents streamed to solr from multiple
uima pipelines. Do you have some experience here ?
> regarding the EmbeddedSolrServer I agree that it's generally not a top option in production but I am working now with a Solr project where network latency has a significance impact (being Solr the best solution anyways) and I'd get a considerable advantage if I can query it avoiding HTTP requests that way, however since the main way to query Solr is via REST calls I have no objections removing it
>
Sounds good, lets use it for testing only. We also need to enhance the 
test. We should add a document and then retrieve
it to see that it is in solr as expected.

Do you want to open the jiras yourself ?

Jörn


Re: Solrcas questions

Posted by Tommaso Teofili <to...@gmail.com>.
Hi Jorn,
I will came back from a little holidays period next week, in the meantime here's a quick list of thoughts on the points you raised.
regarding asserts in the initialize() method they can be safely removed as they were put there mainly for debugging purpose, however the initialization of the Consumer would fail if such params are null or badly defined as you can see inside the createServer(type,path) and inside the FieldMappingReader.getConf(path) methods
the cas element in the mapping file is an optional one and I thought it was useful to track the cas which delivered information, in the sample file it gets mapped inside an id field but it doesn't mean it MUST be unique; however that is optional and maybe the toString() method isn't the best one to store the cas information, but I still think it makes sense to not loose such an information.
I agree with the need to switch to the CAS API
I agree also regarding the enhancing the exception handling for debugging errors; if commit fails I think that should be handled the same way as an add() fails otherwise it should be created a commit policy (i.e. a cache of documents previously added to try to re-send them) parameter but I think it's out of the scope of a basic Solrcas implementation and more related to how Solr handles commit errors
I'd introduce the already discussed autocommit configuration parameter (boolean) to indicate if Solrcas should also send a commit to the SolrServer (it may also make sense to create a third value for this param called 'destroy' that would trigger the commit only on the destroy() method even if in that case any errors during the commit could not be recovered)
regarding the EmbeddedSolrServer I agree that it's generally not a top option in production but I am working now with a Solr project where network latency has a significance impact (being Solr the best solution anyways) and I'd get a considerable advantage if I can query it avoiding HTTP requests that way, however since the main way to query Solr is via REST calls I have no objections removing it
thanks for the fix on UIMA-2041
Cheers,
Tommaso


Il giorno 08/feb/2011, alle ore 15.52, Jörn Kottmann ha scritto:

> I am trying to understand why there is a cas element in the mapping file.
> 
> The documentation explains the it specifies the field in solr
> which is used to map the value of JCas.toString(), but why should
> anyone wants to do that? The documentation sample maps it to
> the id field in solr.
> 
> Can JCas.toString be used as an id?
> If I looked at the code correctly JCasImpl does not overwrite toString,
> then simply Object.toString is called which produces a string based on
> the object address. In one JVM two objects could have the same address
> at two distinct points in time. Which would lead to identical ids for different
> documents.
> Anyway  isn't the JCas instance reused? Then this will just be the same
> string depending on the instance over and over again.
> 
> Jörn
> 


Re: Solrcas questions

Posted by Jörn Kottmann <ko...@gmail.com>.
I am trying to understand why there is a cas element in the mapping file.

The documentation explains the it specifies the field in solr
which is used to map the value of JCas.toString(), but why should
anyone wants to do that? The documentation sample maps it to
the id field in solr.

Can JCas.toString be used as an id?
If I looked at the code correctly JCasImpl does not overwrite toString,
then simply Object.toString is called which produces a string based on
the object address. In one JVM two objects could have the same address
at two distinct points in time. Which would lead to identical ids for 
different
documents.
Anyway  isn't the JCas instance reused? Then this will just be the same
string depending on the instance over and over again.

Jörn