You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stuart Smith <st...@yahoo.com> on 2011/05/12 22:46:16 UTC

A couple newbie questions

Hello!
  I just started using Solr. My general use case is pushing a lot of data from Hbase to solr via an M/R job using Solrj. I have lots of questions, but the ones I'd like to start with are:

(1)
I noticed this:
http://lucene.472066.n3.nabble.com/what-happens-to-docsPending-if-stop-solr-before-commit-td2781493.html

Would seem to indicate that pending documents are commited on restart. This is great! I also noticed, that while there is a lag on start up if I have documents pending - it's only a few minutes or so. But if I issue a commit for the same number of files, the server stays blocked for 20 min or so. It almost seems like it would be a faster to add all my documents and restart the server, rather than issuing a commit. Am I doing something strange? Is this a valid conclusion?

(2)
I'm also getting a lot of errors about invalid UTF-8:

SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 character 0xffff at char #2380289, byte #2378666)
	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)

It could be that the values I have in some of my document fields is indeed invalid. My question is what does this mean when I'm submitting a batch of documents (specifically I'm using Solrj's StreamingUpdateSolrServer w/ a BinaryRequestWriter) - do I:

- lose the whole batch that has the bad document?
- lose the document?
- lose the one field?

I wish it was the third, hope it's the second, and I'm afraid it's the first...

Ooo.. and I guess a third question - I'm having trouble finding a document that describes the overall design/functionality of Solr, something that would help me reason about stuff like "what happens to pending documents when the server restarts" or "does a commit in one indexing thread commit previously added documents from another indexing thread". Both of those I've answered to my satisfaction by looking over the Solr logs & mailing lists, but I'm wondering if there's some documentation I missed somehow..
For example, something like this:
http://hadoop.apache.org/common/docs/current/hdfs_design.html
http://hbase.apache.org/book.html#architecture

Thanks!

Take care,
  -stu