You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brett Hoerner <br...@bretthoerner.com> on 2014/04/17 21:53:21 UTC
Re: index merge question

Sorry to bump this, I have the same issue and was curious about the sanity
of trying to work around it.

* I have a constant stream of realtime documents I need to continually
index. Sometimes they even overwrite very old documents (by using the same
unique ID).
* I also have a *huge* backlog of documents I'd like to get into a
SolrCloud cluster via Hadoop.

I understand that the MERGEINDEXES operation expects me to have unique
documents, but is it reasonable at all for me to be able to change that? In
a plain Solr instance I can add doc1, then add doc1 again with new fields
and the new update "wins" and I assume during segment merges the old update
is eventually removed. Does that mean it's possible for me to somehow
override a merge policy (or something like that?) to effectively do exactly
what my Hadoop conflict-resolver does? I already have logic there that
knows how to (1) decide which of 2 duplicate documents to keep and (2)
respect and "keep" deletes over anything else.

I'd love some pointers at what Solr/Lucene classes to look at if I wanted
to try my hand at this. I'm down in Lucene SegmentMerger right now but it
seems too low level to understand whatever Solr "knows" about enforcing a
single unique ID at merge (and search...? or update...?) time.

Thanks!



On Tue, Jun 11, 2013 at 11:10 AM, Mark Miller <ma...@gmail.com> wrote:

> Right - but that sounds a little different than what we were talking about.
>
> You had brought up the core admin merge cmd that let's you merge an index
> into a running Solr cluster.
>
> We are calling that the golive option in the map reduce indexing code. It
> has the limitations we have discussed.
>
> However, if you are only using map reduce to build indexes, there are
> facilities for dealing with duplicate id's - as you see in the
> documentation. The merges involved in that are different though - these are
> merges that happen as the final index is being constructed by the map
> reduce job. The final step is the golive step, where the indexes will be
> deployed to the running Solr cluster - this is what uses the core admin
> merge command, and if you are doing updates or adds outside of map reduce,
> you will face the issues we have discussed.
>
>
> - Mark
>
> On Jun 11, 2013, at 11:57 AM, James Thomas <JT...@Camstar.com> wrote:
>
> > FWIW, the Solr included with Cloudera Search, by default, "ignores all
> but the most recent document version" during merges.
> > The conflict resolution is configurable however.  See the documentation
> for details.
> >
> http://www.cloudera.com/content/support/en/documentation/cloudera-search/cloudera-search-documentation-v1-latest.html
> > -- see the user guide pdf, " update-conflict-resolver" parameter
> >
> > James
> >
> > -----Original Message-----
> > From: anirudha81@gmail.com [mailto:anirudha81@gmail.com] On Behalf Of
> Anirudha Jadhav
> > Sent: Tuesday, June 11, 2013 10:47 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: index merge question
> >
> > From my experience the lucene mergeTool and the one invoked by coreAdmin
> is a pure lucene implementation and does not understand the concepts of a
> unique Key(solr land concept)
> >
> >  http://wiki.apache.org/solr/MergingSolrIndexes has a cautionary note
> at the end
> >
> > we do frequent index merges for which we externally run map/reduce (
> java code using lucene api's) jobs to merge & validate merged indices with
> sources.
> > -Ani
> >
> > On Tue, Jun 11, 2013 at 10:38 AM, Mark Miller <ma...@gmail.com>
> wrote:
> >> Yeah, you have to carefully manage things if you are map/reduce
> building indexes *and* updating documents in other ways.
> >>
> >> If your 'source' data for MR index building is the 'truth', you also
> have the option of not doing incremental index merging, and you could
> simply rebuild the whole thing every time - of course, depending your
> cluster size, that could be quite expensive.
> >
> >>
> >> - Mark
> >>
> >> On Jun 10, 2013, at 8:36 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>
> >>> Thanks Mark.  My question is stemming from the new cloudera search
> stuff.
> >>> My concern its that if while rebuilding the index someone updates a
> >>> doc that update could be lost from a solr perspective.  I guess what
> >>> would need to happen to ensure the correct information was indexed
> >>> would be to record the start time and reindex the information that
> changed since then?
> >>> On Jun 8, 2013 2:37 PM, "Mark Miller" <ma...@gmail.com> wrote:
> >>>
> >>>>
> >>>> On Jun 8, 2013, at 12:52 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>>
> >>>>> When merging through the core admin (
> >>>>> http://wiki.apache.org/solr/MergingSolrIndexes) what is the policy
> >>>>> for conflicts during the merge?  So for instance if I am merging
> >>>>> core 1 and core 2 into core 0 (first example), what happens if core
> >>>>> 1 and core 2
> >>>> both
> >>>>> have a document with the same key, say core 1 has a newer version
> >>>>> of core 2?  Does the merge fail, does the newer document remain?
> >>>>
> >>>> You end up with both documents, both with that ID - not generally a
> >>>> situation you want to end up in. You need to ensure unique id's in
> >>>> the input data or replace the index rather than merging into it.
> >>>>
> >>>>>
> >>>>> Also if using the srcCore method if a document with key 1 is
> >>>>> written
> >>>> while
> >>>>> an index also with key 1 is being merged what happens?
> >>>>
> >>>> It depends on the order I think - if the doc is written after the
> >>>> merge and it's an update, it will update the doc that was just
> >>>> merged in. If the merge comes second, you have the doc twice and it's
> a problem.
> >>>>
> >>>> - Mark
> >>
> >
> >
> >
> > --
> > Anirudha P. Jadhav
>
>