You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ravi Solr <ra...@gmail.com> on 2015/09/25 23:17:32 UTC

bulk reindexing 5.3.0 issue

I have been trying to re-index the docs (about 1.5 million) as one of the
field needed part of string value removed (accidentally introduced). I was
issuing a query for 100 docs getting 4 fields and updating the doc  (atomic
update with "set") via the CloudSolrClient in batches, However from time to
time the query returns 0 results, which exits the re-indexing program.

I cant understand as to why the cloud returns 0 results when there are 1.4x
million docs which have the "accidental" string in them.

Is there another way to do bulk massive updates ?

Thanks

Ravi Kiran Bhaskar

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Gili I was constantly checking the cloud admin UI and it always stayed
Green, that is why I initially overlooked sync issues...finally when all
options dried out I went individually to each node and quieried and that is
when i found the out of sync issue. The way I resolved my issue was shut
down the leader that was not synching properly and let another node become
the leader, then reindex all docs. Once the reindexing is done I started
the node that was causing the issue and it synched properly :-)

Thanks

Ravi Kiran Bhaskar



On Mon, Sep 28, 2015 at 10:26 AM, Gili Nachum <gi...@gmail.com> wrote:

> Were all of shard replica in active state (green color in admin ui) before
> starting?
> Sounds like it otherwise you won't hit the replica that is out of sync.
>
> Replicas can get out of sync, and report being in sync after a sequence of
> stop start w/o a chance to complete sync.
> See if it might have happened to you:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201412.mbox/%3CCAOOKt53XTU_e0m2ioJ-S4SfsAp8JC6m-=nyBbd4g_MjH60bfpg@mail.gmail.com%3E
> On Sep 27, 2015 06:56, "Ravi Solr" <ra...@gmail.com> wrote:
>
> > Erick...There is only one type of String
> > "sun.org.mozilla.javascript.internal.NativeString:" and no other
> variations
> > of that in my index, so no question of missing it. Point taken regarding
> > the CURSORMARK stuff, yes you are correct, my head so numb at this point
> > after working 3 days on this, I wasnt thinking straight.
> >
> > BTW I found the real issue, I have a total of 8 servers in the solr
> cloud.
> > The leader for this specific collection was the one that was returning 0
> > for the searches. All other 7 servers had roughly 800K docs still needing
> > the string replacement. So maybe the real issue is sync among servers.
> Just
> > to prove to myself I shutdown the solr  that was giving zero results
> (i.e.
> > all uuid strings have already been somehow devoid of spurious
> > sun.org.mozilla.javascript.internal.NativeString on that server). Now it
> > ran perfectly fine and is about to finish as last 103K are still left
> when
> > I was writing this email.
> >
> > So the real question is how can we ensure that the Sync is always
> > maintained and what to do if it ever goes out of Sync, I did see some
> Jira
> > tickets from previous 4.10.x versions where Sync was an issue. Can you
> > please point me to any doc which says how SolrCloud synchs/replicates ?
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > Thanks
> >
> > Rvai Kiran Bhaskar
> >
> > On Sat, Sep 26, 2015 at 11:00 PM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> > > bq: 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially
> > > using
> > > 100 docs batch, which, I later increased to 500 docs per batch. Also it
> > > would not be a infinite loop if I commit for each batch, right !!??
> > >
> > > That's not the point at all. Look at the basic logic here:
> > >
> > > You run for a while processing 100 (or 500 or 1,000) docs per batch
> > > and change all uuid fields with this statement:
> > >
> > > uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
> > >
> > > and then update the doc. You run this as long as you have any docs
> > > that satisfy the query "q=uuid:sun.org.mozilla*", _changing_
> > > every one that has this string!
> > >
> > > At that point, theoretically, no document in your index has this
> string.
> > So
> > > running your update program immediately after should find _zero_
> > documents.
> > >
> > > I've been assuming your complaint is that you don't process 1.4 M docs
> > (in
> > > batches), you process some lower number then exit and you think this is
> > > wrong.
> > > I'm claiming that you should only expect to find as many docs as have
> > been
> > > indexed since the last time the program ran.
> > >
> > > As far as the infinite loop is concerned, again trace the logic in the
> > old
> > > code.
> > > Forget about commits and all the mechanics, just look at the logic.
> > > You're querying on "sun.org.mozilla*". But you only change if you get a
> > > match on
> > > "sun.org.mozilla.javascript.internal.NativeString:"
> > >
> > > Now imagine you have a doc that has sun.org.mozilla.erick in it. That
> doc
> > > gets
> > > returned from the query but does _not_ get modified because it doesn't
> > > match your pattern. In the older code, it would be found again and
> > > returned next
> > > time you queried. Then not modified again. Eventually you'd be in a
> > > position
> > > where you never changed any docs, just kept getting the same docList
> back
> > > over and over again. Marching through based on the unique key should
> not
> > > have the same potential issue.
> > >
> > > You should not be mixing the new query stuff with CURSORMARK. Deep
> paging
> > > supposes the exact same query is being run over and over and you're
> > > _paging_
> > > through the results. You're changing the query every time so the
> results
> > > aren't
> > > very predictable.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Sat, Sep 26, 2015 at 5:01 PM, Ravi Solr <ra...@gmail.com> wrote:
> > > > Erick & Shawn I incrporated your suggestions.
> > > >
> > > >
> > > > 0. Shut off all other indexing processes.
> > > > 1. As Shawn mentioned set batch size to 10000.
> > > > 2. Loved Erick's suggestion about not using filter at all and sort by
> > > > uniqueId and put last known uinqueId as next queries start while
> still
> > > > using cursor marks as follows
> > > >
> > > > SolrQuery q = new SolrQuery("+uuid:sun.org.mozilla* +uniqueId:{" +
> > > > markerSysId + " TO
> > > > *]").setRows(10000).addSort("uniqueId",ORDER.asc).setFields(new
> > > > String[]{"uniqueId","uuid"});
> > > > q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
> > > >
> > > > 3. As per Shawn's advise commented autocommit and soft commit in
> > > > solrconfig.xml and set openSearcher to false and issued MANUAL COMMIT
> > for
> > > > every batch from code as follows
> > > >
> > > > client.commit(true, true, true);
> > > >
> > > > Here is what the log statement & results - log.info("Indexed " +
> > count +
> > > > "/" + docList.getNumFound());
> > > >
> > > >
> > > > 2015-09-26 17:29:57 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 90000/1344085
> > > > 2015-09-26 17:30:30 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 100000/1334085
> > > > 2015-09-26 17:33:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 110000/1324085
> > > > 2015-09-26 17:36:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 120000/1314085
> > > > 2015-09-26 17:39:42 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 130000/1304085
> > > > 2015-09-26 17:43:05 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 140000/1294085
> > > > 2015-09-26 17:46:14 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 150000/1284085
> > > > 2015-09-26 17:48:22 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 160000/1274085
> > > > 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/0
> > > > 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
> > > >
> > > > Ran manually a second time to see if first was fluke. Still same.
> > > >
> > > > 2015-09-26 17:55:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 10000/1264716
> > > > 2015-09-26 17:58:07 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 20000/1254716
> > > > 2015-09-26 18:03:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 30000/1244716
> > > > 2015-09-26 18:06:32 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 40000/1234716
> > > > 2015-09-26 18:10:35 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 50000/1224716
> > > > 2015-09-26 18:15:23 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > > 60000/1214716
> > > > 2015-09-26 18:15:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/0
> > > > 2015-09-26 18:15:26 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
> > > >
> > > > Now changed the autommit in solrconfig.xml as follows...Note the soft
> > > > commit has been shut off as per Shawn's advise
> > > >
> > > >     <autoCommit>
> > > >        <!-- <maxDocs>100</maxDocs> -->
> > > >        <maxTime>300000</maxTime>
> > > >      <openSearcher>false</openSearcher>
> > > >     </autoCommit>
> > > >
> > > >   <!--
> > > >     <autoSoftCommit>
> > > >         <maxTime>30000</maxTime>
> > > >     </autoSoftCommit>
> > > >   -->
> > > >
> > > > 2015-09-26 18:47:44 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > Indexed 10000/1205451
> > > > 2015-09-26 18:50:49 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > Indexed 20000/1195451
> > > > 2015-09-26 18:54:18 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > Indexed 30000/1185451
> > > > 2015-09-26 18:57:04 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > Indexed 40000/1175451
> > > > 2015-09-26 19:00:10 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > Indexed 50000/1165451
> > > > 2015-09-26 19:00:13 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > Indexed 50000/0
> > > > 2015-09-26 19:00:13 INFO
> > [com.wpost.search.reindexing.AdhocCorrectUUID]
> > > -
> > > > FINISHED !!!
> > > >
> > > >
> > > > The query still returned 0 results when they are over million docs
> > > > available which match uuid:sun.org.mozilla* ...Then why do I get 0
> ???
> > > >
> > > > Thanks
> > > >
> > > > Ravi Kiran Bhaskar
> > > >
> > > > On Sat, Sep 26, 2015 at 3:49 PM, Ravi Solr <ra...@gmail.com>
> wrote:
> > > >
> > > >> Thank you Erick & Shawn for taking significant time off your
> weekends
> > to
> > > >> debug and explain in great detail. I will try to address the main
> > points
> > > >> from your emails to provide more situation context for better
> > > understanding
> > > >> of my situation
> > > >>
> > > >> 1. Erick, As part of our upgrade from 4.7.2 to 5.3.0 I re-indexed
> all
> > > docs
> > > >> from my old Master-Slave to My SolrCloud using DIH
> SolrEntityProcessor
> > > >> which used a Script Transformer. I unwittingly messed up the script
> > and
> > > >> hence this 'uuid' (String Type field) got messed up. All records
> prior
> > > to
> > > >> Sep 20 2015 have this issue that I am currently try to rectify.
> > > >>
> > > >> 2. Regarding openSearcher=true/false, I had it as false all along in
> > my
> > > >> 4.7.2 config. I read somewhere that SolrCloud or 5.x doesn't honor
> it
> > > or it
> > > >> should be left default (Don't exactly remember where I read it),
> > hence,
> > > I
> > > >> removed it from my solrconfig.xml going against my intuition :-)
> > > >>
> > > >> 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially
> > > using
> > > >> 100 docs batch, which, I later increased to 500 docs per batch. Also
> > it
> > > >> would not be a infinite loop if I commit for each batch, right !!??
> > > >>
> > > >> 4. Shawn, you are correct the uuid is of String Type and its not
> > unique
> > > >> key for my schema. My uniqueKey is uniqueId and systemid is of no
> > > >> consequence here, it's another field for differentiating apps within
> > my
> > > >> solr.
> > > >>
> > > >> Than you very much again guys. I will incorporate your suggestions
> and
> > > >> report back.
> > > >>
> > > >> Thanks
> > > >>
> > > >> Ravi Kiran Bhaskar
> > > >>
> > > >> On Sat, Sep 26, 2015 at 12:58 PM, Erick Erickson <
> > > erickerickson@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Oh, one more thing. _assuming_ you can't change the indexing
> process
> > > >>> that gets the docs from the system of record, why not just add an
> > > >>> update processor that does this at index time? See:
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
> > > >>> ,
> > > >>> in particular the StatelessScriptUpdateProcessorFactory might be a
> > > >>> good candidate. It just takes a bit of javascript (or other
> scripting
> > > >>> language) and changes the record before it gets indexed.
> > > >>>
> > > >>> FWIW,
> > > >>> Erick
> > > >>>
> > > >>> On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <apache@elyograg.org
> >
> > > >>> wrote:
> > > >>> > On 9/26/2015 10:41 AM, Shawn Heisey wrote:
> > > >>> >> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
> > > >>> >
> > > >>> > This needs to include openSearcher=false, as Erick mentioned.
> I'm
> > > sorry
> > > >>> > I screwed that up:
> > > >>> >
> > > >>> >   <autoCommit>
> > > >>> >     <maxTime>300000</maxTime>
> > > >>> >     <openSearcher>false</openSearcher>
> > > >>> >   </autoCommit>
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Shawn
> > > >>>
> > > >>
> > > >>
> > >
> >
>

Re: bulk reindexing 5.3.0 issue

Posted by Gili Nachum <gi...@gmail.com>.

Were all of shard replica in active state (green color in admin ui) before
starting?
Sounds like it otherwise you won't hit the replica that is out of sync.

Replicas can get out of sync, and report being in sync after a sequence of
stop start w/o a chance to complete sync.
See if it might have happened to you:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201412.mbox/%3CCAOOKt53XTU_e0m2ioJ-S4SfsAp8JC6m-=nyBbd4g_MjH60bfpg@mail.gmail.com%3E
On Sep 27, 2015 06:56, "Ravi Solr" <ra...@gmail.com> wrote:

> Erick...There is only one type of String
> "sun.org.mozilla.javascript.internal.NativeString:" and no other variations
> of that in my index, so no question of missing it. Point taken regarding
> the CURSORMARK stuff, yes you are correct, my head so numb at this point
> after working 3 days on this, I wasnt thinking straight.
>
> BTW I found the real issue, I have a total of 8 servers in the solr cloud.
> The leader for this specific collection was the one that was returning 0
> for the searches. All other 7 servers had roughly 800K docs still needing
> the string replacement. So maybe the real issue is sync among servers. Just
> to prove to myself I shutdown the solr  that was giving zero results (i.e.
> all uuid strings have already been somehow devoid of spurious
> sun.org.mozilla.javascript.internal.NativeString on that server). Now it
> ran perfectly fine and is about to finish as last 103K are still left when
> I was writing this email.
>
> So the real question is how can we ensure that the Sync is always
> maintained and what to do if it ever goes out of Sync, I did see some Jira
> tickets from previous 4.10.x versions where Sync was an issue. Can you
> please point me to any doc which says how SolrCloud synchs/replicates ?
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
> Thanks
>
> Rvai Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 11:00 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > bq: 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially
> > using
> > 100 docs batch, which, I later increased to 500 docs per batch. Also it
> > would not be a infinite loop if I commit for each batch, right !!??
> >
> > That's not the point at all. Look at the basic logic here:
> >
> > You run for a while processing 100 (or 500 or 1,000) docs per batch
> > and change all uuid fields with this statement:
> >
> > uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
> >
> > and then update the doc. You run this as long as you have any docs
> > that satisfy the query "q=uuid:sun.org.mozilla*", _changing_
> > every one that has this string!
> >
> > At that point, theoretically, no document in your index has this string.
> So
> > running your update program immediately after should find _zero_
> documents.
> >
> > I've been assuming your complaint is that you don't process 1.4 M docs
> (in
> > batches), you process some lower number then exit and you think this is
> > wrong.
> > I'm claiming that you should only expect to find as many docs as have
> been
> > indexed since the last time the program ran.
> >
> > As far as the infinite loop is concerned, again trace the logic in the
> old
> > code.
> > Forget about commits and all the mechanics, just look at the logic.
> > You're querying on "sun.org.mozilla*". But you only change if you get a
> > match on
> > "sun.org.mozilla.javascript.internal.NativeString:"
> >
> > Now imagine you have a doc that has sun.org.mozilla.erick in it. That doc
> > gets
> > returned from the query but does _not_ get modified because it doesn't
> > match your pattern. In the older code, it would be found again and
> > returned next
> > time you queried. Then not modified again. Eventually you'd be in a
> > position
> > where you never changed any docs, just kept getting the same docList back
> > over and over again. Marching through based on the unique key should not
> > have the same potential issue.
> >
> > You should not be mixing the new query stuff with CURSORMARK. Deep paging
> > supposes the exact same query is being run over and over and you're
> > _paging_
> > through the results. You're changing the query every time so the results
> > aren't
> > very predictable.
> >
> > Best,
> > Erick
> >
> >
> > On Sat, Sep 26, 2015 at 5:01 PM, Ravi Solr <ra...@gmail.com> wrote:
> > > Erick & Shawn I incrporated your suggestions.
> > >
> > >
> > > 0. Shut off all other indexing processes.
> > > 1. As Shawn mentioned set batch size to 10000.
> > > 2. Loved Erick's suggestion about not using filter at all and sort by
> > > uniqueId and put last known uinqueId as next queries start while still
> > > using cursor marks as follows
> > >
> > > SolrQuery q = new SolrQuery("+uuid:sun.org.mozilla* +uniqueId:{" +
> > > markerSysId + " TO
> > > *]").setRows(10000).addSort("uniqueId",ORDER.asc).setFields(new
> > > String[]{"uniqueId","uuid"});
> > > q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
> > >
> > > 3. As per Shawn's advise commented autocommit and soft commit in
> > > solrconfig.xml and set openSearcher to false and issued MANUAL COMMIT
> for
> > > every batch from code as follows
> > >
> > > client.commit(true, true, true);
> > >
> > > Here is what the log statement & results - log.info("Indexed " +
> count +
> > > "/" + docList.getNumFound());
> > >
> > >
> > > 2015-09-26 17:29:57 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 90000/1344085
> > > 2015-09-26 17:30:30 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 100000/1334085
> > > 2015-09-26 17:33:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 110000/1324085
> > > 2015-09-26 17:36:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 120000/1314085
> > > 2015-09-26 17:39:42 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 130000/1304085
> > > 2015-09-26 17:43:05 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 140000/1294085
> > > 2015-09-26 17:46:14 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 150000/1284085
> > > 2015-09-26 17:48:22 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 160000/1274085
> > > 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/0
> > > 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
> > >
> > > Ran manually a second time to see if first was fluke. Still same.
> > >
> > > 2015-09-26 17:55:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 10000/1264716
> > > 2015-09-26 17:58:07 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 20000/1254716
> > > 2015-09-26 18:03:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 30000/1244716
> > > 2015-09-26 18:06:32 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 40000/1234716
> > > 2015-09-26 18:10:35 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 50000/1224716
> > > 2015-09-26 18:15:23 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> > 60000/1214716
> > > 2015-09-26 18:15:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/0
> > > 2015-09-26 18:15:26 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
> > >
> > > Now changed the autommit in solrconfig.xml as follows...Note the soft
> > > commit has been shut off as per Shawn's advise
> > >
> > >     <autoCommit>
> > >        <!-- <maxDocs>100</maxDocs> -->
> > >        <maxTime>300000</maxTime>
> > >      <openSearcher>false</openSearcher>
> > >     </autoCommit>
> > >
> > >   <!--
> > >     <autoSoftCommit>
> > >         <maxTime>30000</maxTime>
> > >     </autoSoftCommit>
> > >   -->
> > >
> > > 2015-09-26 18:47:44 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > Indexed 10000/1205451
> > > 2015-09-26 18:50:49 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > Indexed 20000/1195451
> > > 2015-09-26 18:54:18 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > Indexed 30000/1185451
> > > 2015-09-26 18:57:04 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > Indexed 40000/1175451
> > > 2015-09-26 19:00:10 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > Indexed 50000/1165451
> > > 2015-09-26 19:00:13 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > Indexed 50000/0
> > > 2015-09-26 19:00:13 INFO
> [com.wpost.search.reindexing.AdhocCorrectUUID]
> > -
> > > FINISHED !!!
> > >
> > >
> > > The query still returned 0 results when they are over million docs
> > > available which match uuid:sun.org.mozilla* ...Then why do I get 0 ???
> > >
> > > Thanks
> > >
> > > Ravi Kiran Bhaskar
> > >
> > > On Sat, Sep 26, 2015 at 3:49 PM, Ravi Solr <ra...@gmail.com> wrote:
> > >
> > >> Thank you Erick & Shawn for taking significant time off your weekends
> to
> > >> debug and explain in great detail. I will try to address the main
> points
> > >> from your emails to provide more situation context for better
> > understanding
> > >> of my situation
> > >>
> > >> 1. Erick, As part of our upgrade from 4.7.2 to 5.3.0 I re-indexed all
> > docs
> > >> from my old Master-Slave to My SolrCloud using DIH SolrEntityProcessor
> > >> which used a Script Transformer. I unwittingly messed up the script
> and
> > >> hence this 'uuid' (String Type field) got messed up. All records prior
> > to
> > >> Sep 20 2015 have this issue that I am currently try to rectify.
> > >>
> > >> 2. Regarding openSearcher=true/false, I had it as false all along in
> my
> > >> 4.7.2 config. I read somewhere that SolrCloud or 5.x doesn't honor it
> > or it
> > >> should be left default (Don't exactly remember where I read it),
> hence,
> > I
> > >> removed it from my solrconfig.xml going against my intuition :-)
> > >>
> > >> 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially
> > using
> > >> 100 docs batch, which, I later increased to 500 docs per batch. Also
> it
> > >> would not be a infinite loop if I commit for each batch, right !!??
> > >>
> > >> 4. Shawn, you are correct the uuid is of String Type and its not
> unique
> > >> key for my schema. My uniqueKey is uniqueId and systemid is of no
> > >> consequence here, it's another field for differentiating apps within
> my
> > >> solr.
> > >>
> > >> Than you very much again guys. I will incorporate your suggestions and
> > >> report back.
> > >>
> > >> Thanks
> > >>
> > >> Ravi Kiran Bhaskar
> > >>
> > >> On Sat, Sep 26, 2015 at 12:58 PM, Erick Erickson <
> > erickerickson@gmail.com>
> > >> wrote:
> > >>
> > >>> Oh, one more thing. _assuming_ you can't change the indexing process
> > >>> that gets the docs from the system of record, why not just add an
> > >>> update processor that does this at index time? See:
> > >>>
> >
> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
> > >>> ,
> > >>> in particular the StatelessScriptUpdateProcessorFactory might be a
> > >>> good candidate. It just takes a bit of javascript (or other scripting
> > >>> language) and changes the record before it gets indexed.
> > >>>
> > >>> FWIW,
> > >>> Erick
> > >>>
> > >>> On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <ap...@elyograg.org>
> > >>> wrote:
> > >>> > On 9/26/2015 10:41 AM, Shawn Heisey wrote:
> > >>> >> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
> > >>> >
> > >>> > This needs to include openSearcher=false, as Erick mentioned.  I'm
> > sorry
> > >>> > I screwed that up:
> > >>> >
> > >>> >   <autoCommit>
> > >>> >     <maxTime>300000</maxTime>
> > >>> >     <openSearcher>false</openSearcher>
> > >>> >   </autoCommit>
> > >>> >
> > >>> > Thanks,
> > >>> > Shawn
> > >>>
> > >>
> > >>
> >
>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Erick...There is only one type of String
"sun.org.mozilla.javascript.internal.NativeString:" and no other variations
of that in my index, so no question of missing it. Point taken regarding
the CURSORMARK stuff, yes you are correct, my head so numb at this point
after working 3 days on this, I wasnt thinking straight.

BTW I found the real issue, I have a total of 8 servers in the solr cloud.
The leader for this specific collection was the one that was returning 0
for the searches. All other 7 servers had roughly 800K docs still needing
the string replacement. So maybe the real issue is sync among servers. Just
to prove to myself I shutdown the solr  that was giving zero results (i.e.
all uuid strings have already been somehow devoid of spurious
sun.org.mozilla.javascript.internal.NativeString on that server). Now it
ran perfectly fine and is about to finish as last 103K are still left when
I was writing this email.

So the real question is how can we ensure that the Sync is always
maintained and what to do if it ever goes out of Sync, I did see some Jira
tickets from previous 4.10.x versions where Sync was an issue. Can you
please point me to any doc which says how SolrCloud synchs/replicates ?

Thanks,

Ravi Kiran Bhaskar

Thanks

Rvai Kiran Bhaskar

On Sat, Sep 26, 2015 at 11:00 PM, Erick Erickson <er...@gmail.com>
wrote:

> bq: 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially
> using
> 100 docs batch, which, I later increased to 500 docs per batch. Also it
> would not be a infinite loop if I commit for each batch, right !!??
>
> That's not the point at all. Look at the basic logic here:
>
> You run for a while processing 100 (or 500 or 1,000) docs per batch
> and change all uuid fields with this statement:
>
> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>
> and then update the doc. You run this as long as you have any docs
> that satisfy the query "q=uuid:sun.org.mozilla*", _changing_
> every one that has this string!
>
> At that point, theoretically, no document in your index has this string. So
> running your update program immediately after should find _zero_ documents.
>
> I've been assuming your complaint is that you don't process 1.4 M docs (in
> batches), you process some lower number then exit and you think this is
> wrong.
> I'm claiming that you should only expect to find as many docs as have been
> indexed since the last time the program ran.
>
> As far as the infinite loop is concerned, again trace the logic in the old
> code.
> Forget about commits and all the mechanics, just look at the logic.
> You're querying on "sun.org.mozilla*". But you only change if you get a
> match on
> "sun.org.mozilla.javascript.internal.NativeString:"
>
> Now imagine you have a doc that has sun.org.mozilla.erick in it. That doc
> gets
> returned from the query but does _not_ get modified because it doesn't
> match your pattern. In the older code, it would be found again and
> returned next
> time you queried. Then not modified again. Eventually you'd be in a
> position
> where you never changed any docs, just kept getting the same docList back
> over and over again. Marching through based on the unique key should not
> have the same potential issue.
>
> You should not be mixing the new query stuff with CURSORMARK. Deep paging
> supposes the exact same query is being run over and over and you're
> _paging_
> through the results. You're changing the query every time so the results
> aren't
> very predictable.
>
> Best,
> Erick
>
>
> On Sat, Sep 26, 2015 at 5:01 PM, Ravi Solr <ra...@gmail.com> wrote:
> > Erick & Shawn I incrporated your suggestions.
> >
> >
> > 0. Shut off all other indexing processes.
> > 1. As Shawn mentioned set batch size to 10000.
> > 2. Loved Erick's suggestion about not using filter at all and sort by
> > uniqueId and put last known uinqueId as next queries start while still
> > using cursor marks as follows
> >
> > SolrQuery q = new SolrQuery("+uuid:sun.org.mozilla* +uniqueId:{" +
> > markerSysId + " TO
> > *]").setRows(10000).addSort("uniqueId",ORDER.asc).setFields(new
> > String[]{"uniqueId","uuid"});
> > q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
> >
> > 3. As per Shawn's advise commented autocommit and soft commit in
> > solrconfig.xml and set openSearcher to false and issued MANUAL COMMIT for
> > every batch from code as follows
> >
> > client.commit(true, true, true);
> >
> > Here is what the log statement & results - log.info("Indexed " + count +
> > "/" + docList.getNumFound());
> >
> >
> > 2015-09-26 17:29:57 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 90000/1344085
> > 2015-09-26 17:30:30 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 100000/1334085
> > 2015-09-26 17:33:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 110000/1324085
> > 2015-09-26 17:36:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 120000/1314085
> > 2015-09-26 17:39:42 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 130000/1304085
> > 2015-09-26 17:43:05 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 140000/1294085
> > 2015-09-26 17:46:14 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 150000/1284085
> > 2015-09-26 17:48:22 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 160000/1274085
> > 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/0
> > 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
> >
> > Ran manually a second time to see if first was fluke. Still same.
> >
> > 2015-09-26 17:55:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 10000/1264716
> > 2015-09-26 17:58:07 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 20000/1254716
> > 2015-09-26 18:03:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 30000/1244716
> > 2015-09-26 18:06:32 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 40000/1234716
> > 2015-09-26 18:10:35 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 50000/1224716
> > 2015-09-26 18:15:23 INFO  [a.b.c.AdhocCorrectUUID] - Indexed
> 60000/1214716
> > 2015-09-26 18:15:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/0
> > 2015-09-26 18:15:26 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
> >
> > Now changed the autommit in solrconfig.xml as follows...Note the soft
> > commit has been shut off as per Shawn's advise
> >
> >     <autoCommit>
> >        <!-- <maxDocs>100</maxDocs> -->
> >        <maxTime>300000</maxTime>
> >      <openSearcher>false</openSearcher>
> >     </autoCommit>
> >
> >   <!--
> >     <autoSoftCommit>
> >         <maxTime>30000</maxTime>
> >     </autoSoftCommit>
> >   -->
> >
> > 2015-09-26 18:47:44 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > Indexed 10000/1205451
> > 2015-09-26 18:50:49 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > Indexed 20000/1195451
> > 2015-09-26 18:54:18 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > Indexed 30000/1185451
> > 2015-09-26 18:57:04 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > Indexed 40000/1175451
> > 2015-09-26 19:00:10 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > Indexed 50000/1165451
> > 2015-09-26 19:00:13 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > Indexed 50000/0
> > 2015-09-26 19:00:13 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID]
> -
> > FINISHED !!!
> >
> >
> > The query still returned 0 results when they are over million docs
> > available which match uuid:sun.org.mozilla* ...Then why do I get 0 ???
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> > On Sat, Sep 26, 2015 at 3:49 PM, Ravi Solr <ra...@gmail.com> wrote:
> >
> >> Thank you Erick & Shawn for taking significant time off your weekends to
> >> debug and explain in great detail. I will try to address the main points
> >> from your emails to provide more situation context for better
> understanding
> >> of my situation
> >>
> >> 1. Erick, As part of our upgrade from 4.7.2 to 5.3.0 I re-indexed all
> docs
> >> from my old Master-Slave to My SolrCloud using DIH SolrEntityProcessor
> >> which used a Script Transformer. I unwittingly messed up the script and
> >> hence this 'uuid' (String Type field) got messed up. All records prior
> to
> >> Sep 20 2015 have this issue that I am currently try to rectify.
> >>
> >> 2. Regarding openSearcher=true/false, I had it as false all along in my
> >> 4.7.2 config. I read somewhere that SolrCloud or 5.x doesn't honor it
> or it
> >> should be left default (Don't exactly remember where I read it), hence,
> I
> >> removed it from my solrconfig.xml going against my intuition :-)
> >>
> >> 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially
> using
> >> 100 docs batch, which, I later increased to 500 docs per batch. Also it
> >> would not be a infinite loop if I commit for each batch, right !!??
> >>
> >> 4. Shawn, you are correct the uuid is of String Type and its not unique
> >> key for my schema. My uniqueKey is uniqueId and systemid is of no
> >> consequence here, it's another field for differentiating apps within my
> >> solr.
> >>
> >> Than you very much again guys. I will incorporate your suggestions and
> >> report back.
> >>
> >> Thanks
> >>
> >> Ravi Kiran Bhaskar
> >>
> >> On Sat, Sep 26, 2015 at 12:58 PM, Erick Erickson <
> erickerickson@gmail.com>
> >> wrote:
> >>
> >>> Oh, one more thing. _assuming_ you can't change the indexing process
> >>> that gets the docs from the system of record, why not just add an
> >>> update processor that does this at index time? See:
> >>>
> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
> >>> ,
> >>> in particular the StatelessScriptUpdateProcessorFactory might be a
> >>> good candidate. It just takes a bit of javascript (or other scripting
> >>> language) and changes the record before it gets indexed.
> >>>
> >>> FWIW,
> >>> Erick
> >>>
> >>> On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <ap...@elyograg.org>
> >>> wrote:
> >>> > On 9/26/2015 10:41 AM, Shawn Heisey wrote:
> >>> >> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
> >>> >
> >>> > This needs to include openSearcher=false, as Erick mentioned.  I'm
> sorry
> >>> > I screwed that up:
> >>> >
> >>> >   <autoCommit>
> >>> >     <maxTime>300000</maxTime>
> >>> >     <openSearcher>false</openSearcher>
> >>> >   </autoCommit>
> >>> >
> >>> > Thanks,
> >>> > Shawn
> >>>
> >>
> >>
>

Re: bulk reindexing 5.3.0 issue

Posted by Erick Erickson <er...@gmail.com>.

bq: 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially using
100 docs batch, which, I later increased to 500 docs per batch. Also it
would not be a infinite loop if I commit for each batch, right !!??

That's not the point at all. Look at the basic logic here:

You run for a while processing 100 (or 500 or 1,000) docs per batch
and change all uuid fields with this statement:

uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");

and then update the doc. You run this as long as you have any docs
that satisfy the query "q=uuid:sun.org.mozilla*", _changing_
every one that has this string!

At that point, theoretically, no document in your index has this string. So
running your update program immediately after should find _zero_ documents.

I've been assuming your complaint is that you don't process 1.4 M docs (in
batches), you process some lower number then exit and you think this is wrong.
I'm claiming that you should only expect to find as many docs as have been
indexed since the last time the program ran.

As far as the infinite loop is concerned, again trace the logic in the old code.
Forget about commits and all the mechanics, just look at the logic.
You're querying on "sun.org.mozilla*". But you only change if you get a match on
"sun.org.mozilla.javascript.internal.NativeString:"

Now imagine you have a doc that has sun.org.mozilla.erick in it. That doc gets
returned from the query but does _not_ get modified because it doesn't
match your pattern. In the older code, it would be found again and returned next
time you queried. Then not modified again. Eventually you'd be in a position
where you never changed any docs, just kept getting the same docList back
over and over again. Marching through based on the unique key should not
have the same potential issue.

You should not be mixing the new query stuff with CURSORMARK. Deep paging
supposes the exact same query is being run over and over and you're _paging_
through the results. You're changing the query every time so the results aren't
very predictable.

Best,
Erick


On Sat, Sep 26, 2015 at 5:01 PM, Ravi Solr <ra...@gmail.com> wrote:
> Erick & Shawn I incrporated your suggestions.
>
>
> 0. Shut off all other indexing processes.
> 1. As Shawn mentioned set batch size to 10000.
> 2. Loved Erick's suggestion about not using filter at all and sort by
> uniqueId and put last known uinqueId as next queries start while still
> using cursor marks as follows
>
> SolrQuery q = new SolrQuery("+uuid:sun.org.mozilla* +uniqueId:{" +
> markerSysId + " TO
> *]").setRows(10000).addSort("uniqueId",ORDER.asc).setFields(new
> String[]{"uniqueId","uuid"});
> q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
>
> 3. As per Shawn's advise commented autocommit and soft commit in
> solrconfig.xml and set openSearcher to false and issued MANUAL COMMIT for
> every batch from code as follows
>
> client.commit(true, true, true);
>
> Here is what the log statement & results - log.info("Indexed " + count +
> "/" + docList.getNumFound());
>
>
> 2015-09-26 17:29:57 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 90000/1344085
> 2015-09-26 17:30:30 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 100000/1334085
> 2015-09-26 17:33:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 110000/1324085
> 2015-09-26 17:36:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 120000/1314085
> 2015-09-26 17:39:42 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 130000/1304085
> 2015-09-26 17:43:05 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 140000/1294085
> 2015-09-26 17:46:14 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 150000/1284085
> 2015-09-26 17:48:22 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/1274085
> 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/0
> 2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
> Ran manually a second time to see if first was fluke. Still same.
>
> 2015-09-26 17:55:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 10000/1264716
> 2015-09-26 17:58:07 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 20000/1254716
> 2015-09-26 18:03:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 30000/1244716
> 2015-09-26 18:06:32 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 40000/1234716
> 2015-09-26 18:10:35 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 50000/1224716
> 2015-09-26 18:15:23 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/1214716
> 2015-09-26 18:15:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/0
> 2015-09-26 18:15:26 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
> Now changed the autommit in solrconfig.xml as follows...Note the soft
> commit has been shut off as per Shawn's advise
>
>     <autoCommit>
>        <!-- <maxDocs>100</maxDocs> -->
>        <maxTime>300000</maxTime>
>      <openSearcher>false</openSearcher>
>     </autoCommit>
>
>   <!--
>     <autoSoftCommit>
>         <maxTime>30000</maxTime>
>     </autoSoftCommit>
>   -->
>
> 2015-09-26 18:47:44 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> Indexed 10000/1205451
> 2015-09-26 18:50:49 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> Indexed 20000/1195451
> 2015-09-26 18:54:18 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> Indexed 30000/1185451
> 2015-09-26 18:57:04 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> Indexed 40000/1175451
> 2015-09-26 19:00:10 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> Indexed 50000/1165451
> 2015-09-26 19:00:13 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> Indexed 50000/0
> 2015-09-26 19:00:13 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
> FINISHED !!!
>
>
> The query still returned 0 results when they are over million docs
> available which match uuid:sun.org.mozilla* ...Then why do I get 0 ???
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 3:49 PM, Ravi Solr <ra...@gmail.com> wrote:
>
>> Thank you Erick & Shawn for taking significant time off your weekends to
>> debug and explain in great detail. I will try to address the main points
>> from your emails to provide more situation context for better understanding
>> of my situation
>>
>> 1. Erick, As part of our upgrade from 4.7.2 to 5.3.0 I re-indexed all docs
>> from my old Master-Slave to My SolrCloud using DIH SolrEntityProcessor
>> which used a Script Transformer. I unwittingly messed up the script and
>> hence this 'uuid' (String Type field) got messed up. All records prior to
>> Sep 20 2015 have this issue that I am currently try to rectify.
>>
>> 2. Regarding openSearcher=true/false, I had it as false all along in my
>> 4.7.2 config. I read somewhere that SolrCloud or 5.x doesn't honor it or it
>> should be left default (Don't exactly remember where I read it), hence, I
>> removed it from my solrconfig.xml going against my intuition :-)
>>
>> 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially using
>> 100 docs batch, which, I later increased to 500 docs per batch. Also it
>> would not be a infinite loop if I commit for each batch, right !!??
>>
>> 4. Shawn, you are correct the uuid is of String Type and its not unique
>> key for my schema. My uniqueKey is uniqueId and systemid is of no
>> consequence here, it's another field for differentiating apps within my
>> solr.
>>
>> Than you very much again guys. I will incorporate your suggestions and
>> report back.
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Sat, Sep 26, 2015 at 12:58 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> Oh, one more thing. _assuming_ you can't change the indexing process
>>> that gets the docs from the system of record, why not just add an
>>> update processor that does this at index time? See:
>>> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
>>> ,
>>> in particular the StatelessScriptUpdateProcessorFactory might be a
>>> good candidate. It just takes a bit of javascript (or other scripting
>>> language) and changes the record before it gets indexed.
>>>
>>> FWIW,
>>> Erick
>>>
>>> On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <ap...@elyograg.org>
>>> wrote:
>>> > On 9/26/2015 10:41 AM, Shawn Heisey wrote:
>>> >> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
>>> >
>>> > This needs to include openSearcher=false, as Erick mentioned.  I'm sorry
>>> > I screwed that up:
>>> >
>>> >   <autoCommit>
>>> >     <maxTime>300000</maxTime>
>>> >     <openSearcher>false</openSearcher>
>>> >   </autoCommit>
>>> >
>>> > Thanks,
>>> > Shawn
>>>
>>
>>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Erick & Shawn I incrporated your suggestions.


0. Shut off all other indexing processes.
1. As Shawn mentioned set batch size to 10000.
2. Loved Erick's suggestion about not using filter at all and sort by
uniqueId and put last known uinqueId as next queries start while still
using cursor marks as follows

SolrQuery q = new SolrQuery("+uuid:sun.org.mozilla* +uniqueId:{" +
markerSysId + " TO
*]").setRows(10000).addSort("uniqueId",ORDER.asc).setFields(new
String[]{"uniqueId","uuid"});
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

3. As per Shawn's advise commented autocommit and soft commit in
solrconfig.xml and set openSearcher to false and issued MANUAL COMMIT for
every batch from code as follows

client.commit(true, true, true);

Here is what the log statement & results - log.info("Indexed " + count +
"/" + docList.getNumFound());


2015-09-26 17:29:57 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 90000/1344085
2015-09-26 17:30:30 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 100000/1334085
2015-09-26 17:33:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 110000/1324085
2015-09-26 17:36:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 120000/1314085
2015-09-26 17:39:42 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 130000/1304085
2015-09-26 17:43:05 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 140000/1294085
2015-09-26 17:46:14 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 150000/1284085
2015-09-26 17:48:22 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/1274085
2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 160000/0
2015-09-26 17:48:25 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!

Ran manually a second time to see if first was fluke. Still same.

2015-09-26 17:55:26 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 10000/1264716
2015-09-26 17:58:07 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 20000/1254716
2015-09-26 18:03:09 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 30000/1244716
2015-09-26 18:06:32 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 40000/1234716
2015-09-26 18:10:35 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 50000/1224716
2015-09-26 18:15:23 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/1214716
2015-09-26 18:15:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 60000/0
2015-09-26 18:15:26 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!

Now changed the autommit in solrconfig.xml as follows...Note the soft
commit has been shut off as per Shawn's advise

    <autoCommit>
       <!-- <maxDocs>100</maxDocs> -->
       <maxTime>300000</maxTime>
     <openSearcher>false</openSearcher>
    </autoCommit>

  <!--
    <autoSoftCommit>
        <maxTime>30000</maxTime>
    </autoSoftCommit>
  -->

2015-09-26 18:47:44 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
Indexed 10000/1205451
2015-09-26 18:50:49 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
Indexed 20000/1195451
2015-09-26 18:54:18 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
Indexed 30000/1185451
2015-09-26 18:57:04 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
Indexed 40000/1175451
2015-09-26 19:00:10 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
Indexed 50000/1165451
2015-09-26 19:00:13 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
Indexed 50000/0
2015-09-26 19:00:13 INFO  [com.wpost.search.reindexing.AdhocCorrectUUID] -
FINISHED !!!


The query still returned 0 results when they are over million docs
available which match uuid:sun.org.mozilla* ...Then why do I get 0 ???

Thanks

Ravi Kiran Bhaskar

On Sat, Sep 26, 2015 at 3:49 PM, Ravi Solr <ra...@gmail.com> wrote:

> Thank you Erick & Shawn for taking significant time off your weekends to
> debug and explain in great detail. I will try to address the main points
> from your emails to provide more situation context for better understanding
> of my situation
>
> 1. Erick, As part of our upgrade from 4.7.2 to 5.3.0 I re-indexed all docs
> from my old Master-Slave to My SolrCloud using DIH SolrEntityProcessor
> which used a Script Transformer. I unwittingly messed up the script and
> hence this 'uuid' (String Type field) got messed up. All records prior to
> Sep 20 2015 have this issue that I am currently try to rectify.
>
> 2. Regarding openSearcher=true/false, I had it as false all along in my
> 4.7.2 config. I read somewhere that SolrCloud or 5.x doesn't honor it or it
> should be left default (Don't exactly remember where I read it), hence, I
> removed it from my solrconfig.xml going against my intuition :-)
>
> 3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially using
> 100 docs batch, which, I later increased to 500 docs per batch. Also it
> would not be a infinite loop if I commit for each batch, right !!??
>
> 4. Shawn, you are correct the uuid is of String Type and its not unique
> key for my schema. My uniqueKey is uniqueId and systemid is of no
> consequence here, it's another field for differentiating apps within my
> solr.
>
> Than you very much again guys. I will incorporate your suggestions and
> report back.
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 12:58 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Oh, one more thing. _assuming_ you can't change the indexing process
>> that gets the docs from the system of record, why not just add an
>> update processor that does this at index time? See:
>> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
>> ,
>> in particular the StatelessScriptUpdateProcessorFactory might be a
>> good candidate. It just takes a bit of javascript (or other scripting
>> language) and changes the record before it gets indexed.
>>
>> FWIW,
>> Erick
>>
>> On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <ap...@elyograg.org>
>> wrote:
>> > On 9/26/2015 10:41 AM, Shawn Heisey wrote:
>> >> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
>> >
>> > This needs to include openSearcher=false, as Erick mentioned.  I'm sorry
>> > I screwed that up:
>> >
>> >   <autoCommit>
>> >     <maxTime>300000</maxTime>
>> >     <openSearcher>false</openSearcher>
>> >   </autoCommit>
>> >
>> > Thanks,
>> > Shawn
>>
>
>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Thank you Erick & Shawn for taking significant time off your weekends to
debug and explain in great detail. I will try to address the main points
from your emails to provide more situation context for better understanding
of my situation

1. Erick, As part of our upgrade from 4.7.2 to 5.3.0 I re-indexed all docs
from my old Master-Slave to My SolrCloud using DIH SolrEntityProcessor
which used a Script Transformer. I unwittingly messed up the script and
hence this 'uuid' (String Type field) got messed up. All records prior to
Sep 20 2015 have this issue that I am currently try to rectify.

2. Regarding openSearcher=true/false, I had it as false all along in my
4.7.2 config. I read somewhere that SolrCloud or 5.x doesn't honor it or it
should be left default (Don't exactly remember where I read it), hence, I
removed it from my solrconfig.xml going against my intuition :-)

3. Erick, I wasnt getting all 1.4 mill in one shot. I was initially using
100 docs batch, which, I later increased to 500 docs per batch. Also it
would not be a infinite loop if I commit for each batch, right !!??

4. Shawn, you are correct the uuid is of String Type and its not unique key
for my schema. My uniqueKey is uniqueId and systemid is of no consequence
here, it's another field for differentiating apps within my solr.

Than you very much again guys. I will incorporate your suggestions and
report back.

Thanks

Ravi Kiran Bhaskar

On Sat, Sep 26, 2015 at 12:58 PM, Erick Erickson <er...@gmail.com>
wrote:

> Oh, one more thing. _assuming_ you can't change the indexing process
> that gets the docs from the system of record, why not just add an
> update processor that does this at index time? See:
> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
> ,
> in particular the StatelessScriptUpdateProcessorFactory might be a
> good candidate. It just takes a bit of javascript (or other scripting
> language) and changes the record before it gets indexed.
>
> FWIW,
> Erick
>
> On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> > On 9/26/2015 10:41 AM, Shawn Heisey wrote:
> >> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
> >
> > This needs to include openSearcher=false, as Erick mentioned.  I'm sorry
> > I screwed that up:
> >
> >   <autoCommit>
> >     <maxTime>300000</maxTime>
> >     <openSearcher>false</openSearcher>
> >   </autoCommit>
> >
> > Thanks,
> > Shawn
>

Re: bulk reindexing 5.3.0 issue

Posted by Erick Erickson <er...@gmail.com>.

Oh, one more thing. _assuming_ you can't change the indexing process
that gets the docs from the system of record, why not just add an
update processor that does this at index time? See:
https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors,
in particular the StatelessScriptUpdateProcessorFactory might be a
good candidate. It just takes a bit of javascript (or other scripting
language) and changes the record before it gets indexed.

FWIW,
Erick

On Sat, Sep 26, 2015 at 9:52 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 9/26/2015 10:41 AM, Shawn Heisey wrote:
>> <autoCommit> <maxTime>300000</maxTime> </autoCommit>
>
> This needs to include openSearcher=false, as Erick mentioned.  I'm sorry
> I screwed that up:
>
>   <autoCommit>
>     <maxTime>300000</maxTime>
>     <openSearcher>false</openSearcher>
>   </autoCommit>
>
> Thanks,
> Shawn

Re: bulk reindexing 5.3.0 issue

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/26/2015 10:41 AM, Shawn Heisey wrote:
> <autoCommit> <maxTime>300000</maxTime> </autoCommit>

This needs to include openSearcher=false, as Erick mentioned.  I'm sorry
I screwed that up:

  <autoCommit>
    <maxTime>300000</maxTime>
    <openSearcher>false</openSearcher>
  </autoCommit>

Thanks,
Shawn

Re: bulk reindexing 5.3.0 issue

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/25/2015 10:10 PM, Ravi Solr wrote:
> thank you for taking time to help me out. Yes I was not using cursorMark, I
> will try that next. This is what I was doing, its a bit shabby coding but
> what can I say my brain was fried :-) FYI this is a side process just to
> correct a messed up string. The actual indexing process was working all the
> time as our business owners are a bit petulant about stopping indexing. My
> autocommit conf and code is given below, as you can see autocommit should
> fire every 100 docs anyway

It took a while, but I finally managed to see how this would page
through the docs.  You are filtering on the text that you are removing.
 This would indeed require that the previous changes are committed
before going through the loop again.  Switching to cursorMark is
probably not necessary, if you optimize your query and your commits.

My advice incorporates some of what Erick said, and some ideas of my own:

I think you should remove autoSoftCommit, and set autoCommit to a
maxTime of 300000 (five minutes) and do not include maxDocs.

    <autoCommit>
       <maxTime>300000</maxTime>
    </autoCommit>

Remove the 5 second sleep from the code.  I would also increase the
number of documents for each loop beyond 100 ... to a minimum of 1000,
possibly more like 10000.  The call to getDocs inside the loop should
not use the size of the previous result, it should use the number of
docs you define for the loop.  After the "add" call in your processDocs
method, you should send a soft commit, so the code looks like this:

  client.add(inList);
  client.commit(true, true, true);

The autoCommit will ensure your transaction log never gets very large,
and the soft commit in your code will take care of change visibility as
quickly as possible.  You might find that some loops take longer than
five seconds, but it should work.

You need to remove the "uuid:[* TO *]" filter.  This is doing
unnecessary (and fairly slow) work on the server side -- the other
filter will ensure that the results would match the range filter, so the
range filter is not necessary.  I assume that you have tried out the
query manually so that you know it actually works?

I'm guessing that uuid is a StrField, not an actual UUID type.  I'm
reasonably certain that if it were a UUID type, it would not have
accepted the class name that you are trying to remove.

What is your uniqueKey field?  I hope it's not uuid.  I think that you
would not get the results you want if that were the case.  Your code
excerpt hints that the uniqueKey is another field.

I pulled your code into a new eclipse project and made the recommended
changes, plus a few other very small modifications.  The results are here:

http://apaste.info/w48

I had no context for the "systemid" variable, so I defined it to get rid
of the compiler error.  It is only used for logging.  I also had to
define the "log" variable to get the code to validate, which I think
you've already done in your own class, so that can be removed from my
workup.  The code is formatted to my company's standard formatting,
which probably doesn't match your own standard.

Something I just noticed:  You could probably remove the sort from the
query, which might reduce the amount of memory used on the Solr server
and make everything generally faster.

If the modified code runs into problems, there might be a serious issue
on the server side your Solr install.

Thanks,
Shawn

Re: bulk reindexing 5.3.0 issue

Posted by Erick Erickson <er...@gmail.com>.

Well, let's forget the cursormark stuff for a bit.

There's no reason you should be getting all 1.4 million rows.
Presumably you've been running this program occasionally and blanking
strings like "sun.org.mozilla.javascript.internal.NativeString:" in
the uuid field. Then you turn around and run the program again with
the fq clause like

fq=uuid:sun.org.mozilla*

and there aren't very many, just the ones that have been added since
the last run. It's possible that your program is running perfectly
correctly. After it runs, have you run a query at the system to see if
there are any records that have this a uuid that starts with
sun.org.mozilla? If not, everything's fine. And there's a potential
infinite loop in here as well. You're removing this string:
"sun.org.mozilla.javascript.internal.NativeString:" but searching on
anything like sun.org.mozilla. Let's claim you have lots records like

uuid:sun.org.mozilla.anything.else

They'll never be replaced by the code you have and just be fetched
over and over and over again.

BTW, having the maxDocs set to 100 is very, very, very short in any
kind of bulk indexing operation and will lead to a lot of churn.
Usually, people either set a hard (autocommit) with openSearcher=false
or use softCommit. Using both is pretty odd. Here's a long blog on the
subject: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Relying on commits to have everything re-ordered so you can just get
the first N, even with a delay, is not very robust, I'd take the
special settings you appear to have in the solrconfig file out and
just let autocommit work as usual, rather concentrate on forming a
query that does what you want in the first place.

So let me see if I understand what you're doing here:

Any uuid with "sun.org.mozilla.javascript.internal.NativeString:"
should have that bit removed, right? Instead of waiting for commits
and all that, why not just form queries like:

q=+uuid:sun.org.mozilla* +uniqueId:{marker TO *]&
sort=uniqueId asc&

where "marker" is * the first time you query, and the uniqueId from
the last record returned the previous time it ran?

Then you wouldn't have to worry about commits, waiting, or anything
else. And you commit interval is beating the crap out of your system
by opening new searchers all the time when you're indexing.

Best,
Erick


On Fri, Sep 25, 2015 at 11:10 PM, Ravi Solr <ra...@gmail.com> wrote:
> Erick I fixed the "missing content stream" issue as well. by making sure
> Iam not adding empty list. However, My very first issue of getting zero
> docs once in a while is still haunting me, even after using cursorMarkers,
> disabling auto commit and soft commit. I ran code two times and you can see
> the statement returns zero docs at random times.
>
> log.info("Indexed " + count + "/" + docList.getNumFound());
>
> -bash-4.1$ tail -f reindexing.log
> 2015-09-26 01:44:40 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 6500/1440653
> 2015-09-26 01:44:44 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7000/1439863
> 2015-09-26 01:44:48 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7500/1439410
> 2015-09-26 01:44:56 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8000/1438918
> 2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/1438330
> 2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/0
> 2015-09-26 01:45:06 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
> 2015-09-26 01:48:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1437440
> 2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1437440
> 2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/0
> 2015-09-26 01:48:22 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 1:17 AM, Ravi Solr <ra...@gmail.com> wrote:
>
>> Erick as per your advise I used cursorMarks (see code below). It was
>> slightly better but Solr throws Exceptions randomly. Please look at the
>> code and Stacktrace below
>>
>> 2015-09-26 01:00:45 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133
>> 2015-09-26 01:00:49 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133
>> 2015-09-26 01:00:54 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592
>> 2015-09-26 01:00:58 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095
>> 2015-09-26 01:01:03 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675
>> 2015-09-26 01:01:10 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924
>> 2015-09-26 01:01:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445
>> 2015-09-26 01:01:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997
>> 2015-09-26 01:01:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692
>> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201
>> 2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://xx.xx.xx.xx:1111/solr/collection1: missing
>> content stream
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
>>     at
>> org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
>>     at
>> org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
>>     at
>> org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
>>     at
>> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
>>     at
>> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
>>     at
>> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
>>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
>>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
>>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
>>     at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97)
>>     at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at com.simontuffs.onejar.Boot.run(Boot.java:306)
>>     at com.simontuffs.onejar.Boot.main(Boot.java:159)
>> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>>
>>
>> CODE
>> ------------
>>     protected static void processDocs() {
>>
>>         try {
>>             CloudSolrClient client = new
>> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>>             client.setDefaultCollection("collection1");
>>
>>             boolean done = false;
>>             String cursorMark = CursorMarkParams.CURSOR_MARK_START;
>>             Integer count = 0;
>>
>>             while (!done) {
>>                 SolrQuery q = new
>> SolrQuery("*:*").setRows(500).addSort("publishtime",
>> ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new
>> String[]{"uniqueId","uuid"});
>>                 q.addFilterQuery(new String[] {"uuid:[* TO *]",
>> "uuid:sun.org.mozilla*"});
>>                 q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
>>
>>                 QueryResponse resp = client.query(q);
>>                 String nextCursorMark = resp.getNextCursorMark();
>>
>>                 SolrDocumentList docList = resp.getResults();
>>
>>                 List<SolrInputDocument> inList = new
>> ArrayList<SolrInputDocument>();
>>                 for(SolrDocument doc : docList) {
>>
>>                     SolrInputDocument iDoc =
>> ClientUtils.toSolrInputDocument(doc);
>>
>>                     //This is my system's id
>>                     String uniqueId = (String)
>> iDoc.getFieldValue("uniqueId");
>>
>>                     /*
>>                      * This is another system's unique id which is what I
>> want to correct that was messed
>>                      * because of script transformer in DIH import via
>> SolrEntityProcessor
>>                      * ex-
>> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>>                      */
>>                     String uuid = (String) iDoc.getFieldValue("uuid");
>>                     String sanitizedUUID =
>> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>>                     Map<String,String> fieldModifier = new
>> HashMap<String,String>(1);
>>                     fieldModifier.put("set",sanitizedUUID);
>>                     iDoc.setField("uuid", fieldModifier);
>>
>>                     inList.add(iDoc);
>>                 }
>>                 client.add(inList);
>>
>>                 count = count + docList.size();
>>                 log.info("Indexed " + count + "/" +
>> docList.getNumFound());
>>
>>                 if (cursorMark.equals(nextCursorMark)) {
>>                     done = true;
>>                     client.commit(true, true);
>>                 }
>>                 cursorMark = nextCursorMark;
>>             }
>>
>>         } catch (Exception e) {
>>             log.error("Error indexing ", e);
>>         }
>>     }
>>
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Sat, Sep 26, 2015 at 12:10 AM, Ravi Solr <ra...@gmail.com> wrote:
>>
>>> thank you for taking time to help me out. Yes I was not using cursorMark,
>>> I will try that next. This is what I was doing, its a bit shabby coding but
>>> what can I say my brain was fried :-) FYI this is a side process just to
>>> correct a messed up string. The actual indexing process was working all the
>>> time as our business owners are a bit petulant about stopping indexing. My
>>> autocommit conf and code is given below, as you can see autocommit should
>>> fire every 100 docs anyway
>>>
>>>     <autoCommit>
>>>        <maxDocs>100</maxDocs>
>>>        <maxTime>120000</maxTime>
>>>     </autoCommit>
>>>
>>>     <autoSoftCommit>
>>>         <maxTime>30000</maxTime>
>>>     </autoSoftCommit>
>>>   </updateHandler>
>>>
>>>     private static void processDocs() {
>>>
>>>         try {
>>>             CloudSolrClient client = new
>>> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>>>             client.setDefaultCollection("collection1");
>>>
>>>             //First initialize docs
>>>             SolrDocumentList docList = getDocs(client, 100);
>>>             Long count = 0L;
>>>
>>>             while (docList != null && docList.size() > 0) {
>>>
>>>                 List<SolrInputDocument> inList = new
>>> ArrayList<SolrInputDocument>();
>>>                 for(SolrDocument doc : docList) {
>>>
>>>                     SolrInputDocument iDoc =
>>> ClientUtils.toSolrInputDocument(doc);
>>>
>>>                     //This is my SOLR's Unique id
>>>                     String uniqueId = (String)
>>> iDoc.getFieldValue("uniqueId");
>>>
>>>                     /*
>>>                      * This is another system's id which is what I want
>>> to correct. Was messed
>>>                      * because of script transformer in DIH import via
>>> SolrEntityProcessor
>>>                      * ex-
>>> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>>>                      */
>>>                     String uuid = (String) iDoc.getFieldValue("uuid");
>>>                     String sanitizedUUID =
>>> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>>>                     Map<String,String> fieldModifier = new
>>> HashMap<String,String>(1);
>>>                     fieldModifier.put("set",sanitizedUUID);
>>>                     iDoc.setField("uuid", fieldModifier);
>>>
>>>                     inList.add(iDoc);
>>>                     log.info("added " + systemid);
>>>                 }
>>>
>>>                 client.add(inList);
>>>
>>>                 count = count + docList.size();
>>>                 log.info("Indexed " + count + "/" +
>>> docList.getNumFound());
>>>
>>>                 Thread.sleep(5000);
>>>
>>>                 docList = getDocs(client, docList.size());
>>>                 log.info("Got Docs- " + docList.getNumFound());
>>>             }
>>>
>>>         } catch (Exception e) {
>>>             log.error("Error indexing ", e);
>>>         }
>>>     }
>>>
>>>     private static SolrDocumentList getDocs(CloudSolrClient client,
>>> Integer rows) {
>>>
>>>
>>>         SolrQuery q = new SolrQuery("*:*");
>>>         q.setSort("publishtime", ORDER.desc);
>>>         q.setStart(0);
>>>         q.setRows(rows);
>>>         q.addFilterQuery(new String[] {"uuid:[* TO *]",
>>> "uuid:sun.org.mozilla*"});
>>>         q.setFields(new String[]{"uniqueId","uuid"});
>>>         SolrDocumentList docList = null;
>>>         QueryResponse resp;
>>>         try {
>>>             resp = client.query(q);
>>>             docList = resp.getResults();
>>>         } catch (Exception e) {
>>>             log.error("Error querying " + q.toString(), e);
>>>         }
>>>         return docList;
>>>     }
>>>
>>>
>>> Thanks
>>>
>>> Ravi Kiran Bhaskar
>>>
>>> On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <erickerickson@gmail.com
>>> > wrote:
>>>
>>>> Wait, query again how? You've got to have something that keeps you
>>>> from getting the same 100 docs back so you have to be sorting somehow.
>>>> Or you have a high water mark. Or something. Waiting 5 seconds for any
>>>> commit also doesn't really make sense to me. I mean how do you know
>>>>
>>>> 1> that you're going to get a commit (did you explicitly send one from
>>>> the client?).
>>>> 2> all autowarming will be complete by the time the next query hits?
>>>>
>>>> Let's see the query you fire. There has to be some kind of marker that
>>>> you're using to know when you've gotten through the entire set.
>>>>
>>>> And I would use much larger batches, I usually update in batches of
>>>> 1,000 (excepting if these are very large docs of course). I suspect
>>>> you're spending a lot more time sleeping than you need to. I wouldn't
>>>> sleep at all in fact. This is one (rare) case I might consider
>>>> committing from the client. If you specify the wait for searcher param
>>>> (server.commit(true, true), then it doesn't return until a new
>>>> searcher is completely opened so your previous updates will be
>>>> reflected in your next search.
>>>>
>>>> Actually, what I'd really do is
>>>> 1> turn off all auto commits
>>>> 2> go ahead and query/change/update. But the query bits would be using
>>>> the cursormark.
>>>> 3> do NOT commit
>>>> 4> issue a commit when you were all done.
>>>>
>>>> I bet you'd get through your update a lot faster that way.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ra...@gmail.com> wrote:
>>>> > Thanks for responding Erick. I set the "start" to zero and "rows"
>>>> always to
>>>> > 100. I create CloudSolrClient instance and use it to both query as
>>>> well as
>>>> > index. But I do sleep for 5 secs just to allow for any auto commits.
>>>> >
>>>> > So query --> client.add(100 docs) --> wait --> query again
>>>> >
>>>> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
>>>> > docs the "query again" returns zero docs causing my while loop to
>>>> > exist...so was trying to see if I was doing the right thing or if
>>>> there is
>>>> > an alternate way to do heavy indexing.
>>>> >
>>>> > Thanks
>>>> >
>>>> > Ravi Kiran Bhaskar
>>>> >
>>>> >
>>>> >
>>>> > On Friday, September 25, 2015, Erick Erickson <erickerickson@gmail.com
>>>> >
>>>> > wrote:
>>>> >
>>>> >> How are you querying Solr? You say you query for 100 docs,
>>>> >> update then get the next set. What are you using for a marker?
>>>> >> If you're using the start parameter, and somehow a commit is
>>>> >> creeping in things might be weird, especially if you're using any
>>>> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
>>>> >> are taking place even that should be OK.
>>>> >>
>>>> >> The "deep paging" stuff could be helpful here, see:
>>>> >>
>>>> >>
>>>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>>> >>
>>>> >> Best,
>>>> >> Erick
>>>> >>
>>>> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravisolr@gmail.com
>>>> >> <javascript:;>> wrote:
>>>> >> > No problem Walter, it's all fun. Was just wondering if there was
>>>> some
>>>> >> other
>>>> >> > good way that I did not know of, that's all 😀
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >> > Ravi Kiran Bhaskar
>>>> >> >
>>>> >> > On Friday, September 25, 2015, Walter Underwood <
>>>> wunder@wunderwood.org
>>>> >> <javascript:;>>
>>>> >> > wrote:
>>>> >> >
>>>> >> >> Sorry, I did not mean to be rude. The original question did not
>>>> say that
>>>> >> >> you don’t have the docs outside of Solr. Some people jump to the
>>>> >> advanced
>>>> >> >> features and miss the simple ones.
>>>> >> >>
>>>> >> >> It might be faster to fetch all the docs from Solr and save them in
>>>> >> files.
>>>> >> >> Then modify them. Then reload all of them. No guarantee, but it is
>>>> >> worth a
>>>> >> >> try.
>>>> >> >>
>>>> >> >> Good luck.
>>>> >> >>
>>>> >> >> wunder
>>>> >> >> Walter Underwood
>>>> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>>>> >> >> http://observer.wunderwood.org/  (my blog)
>>>> >> >>
>>>> >> >>
>>>> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
>>>> >> <javascript:;>
>>>> >> >> <javascript:;>> wrote:
>>>> >> >> >
>>>> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
>>>> friday
>>>> >> and
>>>> >> >> > Iam stuck here trying to figure reindexing issues :-)
>>>> >> >> > I dont have source of docs so I have to query the SOLR, modify
>>>> and
>>>> >> put it
>>>> >> >> > back and that is seeming to be quite a task in 5.3.0, I did
>>>> reindex
>>>> >> >> several
>>>> >> >> > times with 4.7.2 in a master slave env without any issue. Since
>>>> then
>>>> >> we
>>>> >> >> > have moved to cloud and it has been a pain all day.
>>>> >> >> >
>>>> >> >> > Thanks
>>>> >> >> >
>>>> >> >> > Ravi Kiran Bhaskar
>>>> >> >> >
>>>> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>>>> >> wunder@wunderwood.org <javascript:;>
>>>> >> >> <javascript:;>>
>>>> >> >> > wrote:
>>>> >> >> >
>>>> >> >> >> Sure.
>>>> >> >> >>
>>>> >> >> >> 1. Delete all the docs (no commit).
>>>> >> >> >> 2. Add all the docs (no commit).
>>>> >> >> >> 3. Commit.
>>>> >> >> >>
>>>> >> >> >> wunder
>>>> >> >> >> Walter Underwood
>>>> >> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>>>> >> >> >> http://observer.wunderwood.org/  (my blog)
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
>>>> >> <javascript:;>
>>>> >> >> <javascript:;>> wrote:
>>>> >> >> >>>
>>>> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
>>>> one
>>>> >> of
>>>> >> >> the
>>>> >> >> >>> field needed part of string value removed (accidentally
>>>> >> introduced). I
>>>> >> >> >> was
>>>> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
>>>> doc
>>>> >> >> >> (atomic
>>>> >> >> >>> update with "set") via the CloudSolrClient in batches, However
>>>> from
>>>> >> >> time
>>>> >> >> >> to
>>>> >> >> >>> time the query returns 0 results, which exits the re-indexing
>>>> >> program.
>>>> >> >> >>>
>>>> >> >> >>> I cant understand as to why the cloud returns 0 results when
>>>> there
>>>> >> are
>>>> >> >> >> 1.4x
>>>> >> >> >>> million docs which have the "accidental" string in them.
>>>> >> >> >>>
>>>> >> >> >>> Is there another way to do bulk massive updates ?
>>>> >> >> >>>
>>>> >> >> >>> Thanks
>>>> >> >> >>>
>>>> >> >> >>> Ravi Kiran Bhaskar
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >>
>>>> >> >>
>>>> >>
>>>>
>>>
>>>
>>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Erick I fixed the "missing content stream" issue as well. by making sure
Iam not adding empty list. However, My very first issue of getting zero
docs once in a while is still haunting me, even after using cursorMarkers,
disabling auto commit and soft commit. I ran code two times and you can see
the statement returns zero docs at random times.

log.info("Indexed " + count + "/" + docList.getNumFound());

-bash-4.1$ tail -f reindexing.log
2015-09-26 01:44:40 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 6500/1440653
2015-09-26 01:44:44 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7000/1439863
2015-09-26 01:44:48 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7500/1439410
2015-09-26 01:44:56 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8000/1438918
2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/1438330
2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/0
2015-09-26 01:45:06 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!

2015-09-26 01:48:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1437440
2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1437440
2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/0
2015-09-26 01:48:22 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!


Thanks

Ravi Kiran Bhaskar

On Sat, Sep 26, 2015 at 1:17 AM, Ravi Solr <ra...@gmail.com> wrote:

> Erick as per your advise I used cursorMarks (see code below). It was
> slightly better but Solr throws Exceptions randomly. Please look at the
> code and Stacktrace below
>
> 2015-09-26 01:00:45 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133
> 2015-09-26 01:00:49 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133
> 2015-09-26 01:00:54 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592
> 2015-09-26 01:00:58 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095
> 2015-09-26 01:01:03 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675
> 2015-09-26 01:01:10 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924
> 2015-09-26 01:01:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445
> 2015-09-26 01:01:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997
> 2015-09-26 01:01:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692
> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201
> 2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://xx.xx.xx.xx:1111/solr/collection1: missing
> content stream
>     at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
>     at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
>     at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
>     at
> org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
>     at
> org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
>     at
> org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
>     at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
>     at
> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
>     at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
>     at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97)
>     at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at com.simontuffs.onejar.Boot.run(Boot.java:306)
>     at com.simontuffs.onejar.Boot.main(Boot.java:159)
> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
>
> CODE
> ------------
>     protected static void processDocs() {
>
>         try {
>             CloudSolrClient client = new
> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>             client.setDefaultCollection("collection1");
>
>             boolean done = false;
>             String cursorMark = CursorMarkParams.CURSOR_MARK_START;
>             Integer count = 0;
>
>             while (!done) {
>                 SolrQuery q = new
> SolrQuery("*:*").setRows(500).addSort("publishtime",
> ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new
> String[]{"uniqueId","uuid"});
>                 q.addFilterQuery(new String[] {"uuid:[* TO *]",
> "uuid:sun.org.mozilla*"});
>                 q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
>
>                 QueryResponse resp = client.query(q);
>                 String nextCursorMark = resp.getNextCursorMark();
>
>                 SolrDocumentList docList = resp.getResults();
>
>                 List<SolrInputDocument> inList = new
> ArrayList<SolrInputDocument>();
>                 for(SolrDocument doc : docList) {
>
>                     SolrInputDocument iDoc =
> ClientUtils.toSolrInputDocument(doc);
>
>                     //This is my system's id
>                     String uniqueId = (String)
> iDoc.getFieldValue("uniqueId");
>
>                     /*
>                      * This is another system's unique id which is what I
> want to correct that was messed
>                      * because of script transformer in DIH import via
> SolrEntityProcessor
>                      * ex-
> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>                      */
>                     String uuid = (String) iDoc.getFieldValue("uuid");
>                     String sanitizedUUID =
> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>                     Map<String,String> fieldModifier = new
> HashMap<String,String>(1);
>                     fieldModifier.put("set",sanitizedUUID);
>                     iDoc.setField("uuid", fieldModifier);
>
>                     inList.add(iDoc);
>                 }
>                 client.add(inList);
>
>                 count = count + docList.size();
>                 log.info("Indexed " + count + "/" +
> docList.getNumFound());
>
>                 if (cursorMark.equals(nextCursorMark)) {
>                     done = true;
>                     client.commit(true, true);
>                 }
>                 cursorMark = nextCursorMark;
>             }
>
>         } catch (Exception e) {
>             log.error("Error indexing ", e);
>         }
>     }
>
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 12:10 AM, Ravi Solr <ra...@gmail.com> wrote:
>
>> thank you for taking time to help me out. Yes I was not using cursorMark,
>> I will try that next. This is what I was doing, its a bit shabby coding but
>> what can I say my brain was fried :-) FYI this is a side process just to
>> correct a messed up string. The actual indexing process was working all the
>> time as our business owners are a bit petulant about stopping indexing. My
>> autocommit conf and code is given below, as you can see autocommit should
>> fire every 100 docs anyway
>>
>>     <autoCommit>
>>        <maxDocs>100</maxDocs>
>>        <maxTime>120000</maxTime>
>>     </autoCommit>
>>
>>     <autoSoftCommit>
>>         <maxTime>30000</maxTime>
>>     </autoSoftCommit>
>>   </updateHandler>
>>
>>     private static void processDocs() {
>>
>>         try {
>>             CloudSolrClient client = new
>> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>>             client.setDefaultCollection("collection1");
>>
>>             //First initialize docs
>>             SolrDocumentList docList = getDocs(client, 100);
>>             Long count = 0L;
>>
>>             while (docList != null && docList.size() > 0) {
>>
>>                 List<SolrInputDocument> inList = new
>> ArrayList<SolrInputDocument>();
>>                 for(SolrDocument doc : docList) {
>>
>>                     SolrInputDocument iDoc =
>> ClientUtils.toSolrInputDocument(doc);
>>
>>                     //This is my SOLR's Unique id
>>                     String uniqueId = (String)
>> iDoc.getFieldValue("uniqueId");
>>
>>                     /*
>>                      * This is another system's id which is what I want
>> to correct. Was messed
>>                      * because of script transformer in DIH import via
>> SolrEntityProcessor
>>                      * ex-
>> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>>                      */
>>                     String uuid = (String) iDoc.getFieldValue("uuid");
>>                     String sanitizedUUID =
>> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>>                     Map<String,String> fieldModifier = new
>> HashMap<String,String>(1);
>>                     fieldModifier.put("set",sanitizedUUID);
>>                     iDoc.setField("uuid", fieldModifier);
>>
>>                     inList.add(iDoc);
>>                     log.info("added " + systemid);
>>                 }
>>
>>                 client.add(inList);
>>
>>                 count = count + docList.size();
>>                 log.info("Indexed " + count + "/" +
>> docList.getNumFound());
>>
>>                 Thread.sleep(5000);
>>
>>                 docList = getDocs(client, docList.size());
>>                 log.info("Got Docs- " + docList.getNumFound());
>>             }
>>
>>         } catch (Exception e) {
>>             log.error("Error indexing ", e);
>>         }
>>     }
>>
>>     private static SolrDocumentList getDocs(CloudSolrClient client,
>> Integer rows) {
>>
>>
>>         SolrQuery q = new SolrQuery("*:*");
>>         q.setSort("publishtime", ORDER.desc);
>>         q.setStart(0);
>>         q.setRows(rows);
>>         q.addFilterQuery(new String[] {"uuid:[* TO *]",
>> "uuid:sun.org.mozilla*"});
>>         q.setFields(new String[]{"uniqueId","uuid"});
>>         SolrDocumentList docList = null;
>>         QueryResponse resp;
>>         try {
>>             resp = client.query(q);
>>             docList = resp.getResults();
>>         } catch (Exception e) {
>>             log.error("Error querying " + q.toString(), e);
>>         }
>>         return docList;
>>     }
>>
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <erickerickson@gmail.com
>> > wrote:
>>
>>> Wait, query again how? You've got to have something that keeps you
>>> from getting the same 100 docs back so you have to be sorting somehow.
>>> Or you have a high water mark. Or something. Waiting 5 seconds for any
>>> commit also doesn't really make sense to me. I mean how do you know
>>>
>>> 1> that you're going to get a commit (did you explicitly send one from
>>> the client?).
>>> 2> all autowarming will be complete by the time the next query hits?
>>>
>>> Let's see the query you fire. There has to be some kind of marker that
>>> you're using to know when you've gotten through the entire set.
>>>
>>> And I would use much larger batches, I usually update in batches of
>>> 1,000 (excepting if these are very large docs of course). I suspect
>>> you're spending a lot more time sleeping than you need to. I wouldn't
>>> sleep at all in fact. This is one (rare) case I might consider
>>> committing from the client. If you specify the wait for searcher param
>>> (server.commit(true, true), then it doesn't return until a new
>>> searcher is completely opened so your previous updates will be
>>> reflected in your next search.
>>>
>>> Actually, what I'd really do is
>>> 1> turn off all auto commits
>>> 2> go ahead and query/change/update. But the query bits would be using
>>> the cursormark.
>>> 3> do NOT commit
>>> 4> issue a commit when you were all done.
>>>
>>> I bet you'd get through your update a lot faster that way.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ra...@gmail.com> wrote:
>>> > Thanks for responding Erick. I set the "start" to zero and "rows"
>>> always to
>>> > 100. I create CloudSolrClient instance and use it to both query as
>>> well as
>>> > index. But I do sleep for 5 secs just to allow for any auto commits.
>>> >
>>> > So query --> client.add(100 docs) --> wait --> query again
>>> >
>>> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
>>> > docs the "query again" returns zero docs causing my while loop to
>>> > exist...so was trying to see if I was doing the right thing or if
>>> there is
>>> > an alternate way to do heavy indexing.
>>> >
>>> > Thanks
>>> >
>>> > Ravi Kiran Bhaskar
>>> >
>>> >
>>> >
>>> > On Friday, September 25, 2015, Erick Erickson <erickerickson@gmail.com
>>> >
>>> > wrote:
>>> >
>>> >> How are you querying Solr? You say you query for 100 docs,
>>> >> update then get the next set. What are you using for a marker?
>>> >> If you're using the start parameter, and somehow a commit is
>>> >> creeping in things might be weird, especially if you're using any
>>> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
>>> >> are taking place even that should be OK.
>>> >>
>>> >> The "deep paging" stuff could be helpful here, see:
>>> >>
>>> >>
>>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>> >>
>>> >> Best,
>>> >> Erick
>>> >>
>>> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravisolr@gmail.com
>>> >> <javascript:;>> wrote:
>>> >> > No problem Walter, it's all fun. Was just wondering if there was
>>> some
>>> >> other
>>> >> > good way that I did not know of, that's all 😀
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> > Ravi Kiran Bhaskar
>>> >> >
>>> >> > On Friday, September 25, 2015, Walter Underwood <
>>> wunder@wunderwood.org
>>> >> <javascript:;>>
>>> >> > wrote:
>>> >> >
>>> >> >> Sorry, I did not mean to be rude. The original question did not
>>> say that
>>> >> >> you don’t have the docs outside of Solr. Some people jump to the
>>> >> advanced
>>> >> >> features and miss the simple ones.
>>> >> >>
>>> >> >> It might be faster to fetch all the docs from Solr and save them in
>>> >> files.
>>> >> >> Then modify them. Then reload all of them. No guarantee, but it is
>>> >> worth a
>>> >> >> try.
>>> >> >>
>>> >> >> Good luck.
>>> >> >>
>>> >> >> wunder
>>> >> >> Walter Underwood
>>> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>>> >> >> http://observer.wunderwood.org/  (my blog)
>>> >> >>
>>> >> >>
>>> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
>>> >> <javascript:;>
>>> >> >> <javascript:;>> wrote:
>>> >> >> >
>>> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
>>> friday
>>> >> and
>>> >> >> > Iam stuck here trying to figure reindexing issues :-)
>>> >> >> > I dont have source of docs so I have to query the SOLR, modify
>>> and
>>> >> put it
>>> >> >> > back and that is seeming to be quite a task in 5.3.0, I did
>>> reindex
>>> >> >> several
>>> >> >> > times with 4.7.2 in a master slave env without any issue. Since
>>> then
>>> >> we
>>> >> >> > have moved to cloud and it has been a pain all day.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >> > Ravi Kiran Bhaskar
>>> >> >> >
>>> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>>> >> wunder@wunderwood.org <javascript:;>
>>> >> >> <javascript:;>>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> >> Sure.
>>> >> >> >>
>>> >> >> >> 1. Delete all the docs (no commit).
>>> >> >> >> 2. Add all the docs (no commit).
>>> >> >> >> 3. Commit.
>>> >> >> >>
>>> >> >> >> wunder
>>> >> >> >> Walter Underwood
>>> >> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>>> >> >> >> http://observer.wunderwood.org/  (my blog)
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
>>> >> <javascript:;>
>>> >> >> <javascript:;>> wrote:
>>> >> >> >>>
>>> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
>>> one
>>> >> of
>>> >> >> the
>>> >> >> >>> field needed part of string value removed (accidentally
>>> >> introduced). I
>>> >> >> >> was
>>> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
>>> doc
>>> >> >> >> (atomic
>>> >> >> >>> update with "set") via the CloudSolrClient in batches, However
>>> from
>>> >> >> time
>>> >> >> >> to
>>> >> >> >>> time the query returns 0 results, which exits the re-indexing
>>> >> program.
>>> >> >> >>>
>>> >> >> >>> I cant understand as to why the cloud returns 0 results when
>>> there
>>> >> are
>>> >> >> >> 1.4x
>>> >> >> >>> million docs which have the "accidental" string in them.
>>> >> >> >>>
>>> >> >> >>> Is there another way to do bulk massive updates ?
>>> >> >> >>>
>>> >> >> >>> Thanks
>>> >> >> >>>
>>> >> >> >>> Ravi Kiran Bhaskar
>>> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>>
>>
>>
>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Erick as per your advise I used cursorMarks (see code below). It was
slightly better but Solr throws Exceptions randomly. Please look at the
code and Stacktrace below

2015-09-26 01:00:45 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133
2015-09-26 01:00:49 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133
2015-09-26 01:00:54 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592
2015-09-26 01:00:58 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095
2015-09-26 01:01:03 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675
2015-09-26 01:01:10 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924
2015-09-26 01:01:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445
2015-09-26 01:01:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997
2015-09-26 01:01:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692
2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201
2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://xx.xx.xx.xx:1111/solr/collection1: missing content
stream
    at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
    at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
    at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
    at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
    at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
    at
org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
    at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
    at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
    at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
    at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97)
    at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.simontuffs.onejar.Boot.run(Boot.java:306)
    at com.simontuffs.onejar.Boot.main(Boot.java:159)
2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!


CODE
------------
    protected static void processDocs() {

        try {
            CloudSolrClient client = new
CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
            client.setDefaultCollection("collection1");

            boolean done = false;
            String cursorMark = CursorMarkParams.CURSOR_MARK_START;
            Integer count = 0;

            while (!done) {
                SolrQuery q = new
SolrQuery("*:*").setRows(500).addSort("publishtime",
ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new
String[]{"uniqueId","uuid"});
                q.addFilterQuery(new String[] {"uuid:[* TO *]",
"uuid:sun.org.mozilla*"});
                q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

                QueryResponse resp = client.query(q);
                String nextCursorMark = resp.getNextCursorMark();

                SolrDocumentList docList = resp.getResults();

                List<SolrInputDocument> inList = new
ArrayList<SolrInputDocument>();
                for(SolrDocument doc : docList) {

                    SolrInputDocument iDoc =
ClientUtils.toSolrInputDocument(doc);

                    //This is my system's id
                    String uniqueId = (String)
iDoc.getFieldValue("uniqueId");

                    /*
                     * This is another system's unique id which is what I
want to correct that was messed
                     * because of script transformer in DIH import via
SolrEntityProcessor
                     * ex-
sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
                     */
                    String uuid = (String) iDoc.getFieldValue("uuid");
                    String sanitizedUUID =
uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
                    Map<String,String> fieldModifier = new
HashMap<String,String>(1);
                    fieldModifier.put("set",sanitizedUUID);
                    iDoc.setField("uuid", fieldModifier);

                    inList.add(iDoc);
                }
                client.add(inList);

                count = count + docList.size();
                log.info("Indexed " + count + "/" +
docList.getNumFound());

                if (cursorMark.equals(nextCursorMark)) {
                    done = true;
                    client.commit(true, true);
                }
                cursorMark = nextCursorMark;
            }

        } catch (Exception e) {
            log.error("Error indexing ", e);
        }
    }


Thanks

Ravi Kiran Bhaskar

On Sat, Sep 26, 2015 at 12:10 AM, Ravi Solr <ra...@gmail.com> wrote:

> thank you for taking time to help me out. Yes I was not using cursorMark,
> I will try that next. This is what I was doing, its a bit shabby coding but
> what can I say my brain was fried :-) FYI this is a side process just to
> correct a messed up string. The actual indexing process was working all the
> time as our business owners are a bit petulant about stopping indexing. My
> autocommit conf and code is given below, as you can see autocommit should
> fire every 100 docs anyway
>
>     <autoCommit>
>        <maxDocs>100</maxDocs>
>        <maxTime>120000</maxTime>
>     </autoCommit>
>
>     <autoSoftCommit>
>         <maxTime>30000</maxTime>
>     </autoSoftCommit>
>   </updateHandler>
>
>     private static void processDocs() {
>
>         try {
>             CloudSolrClient client = new
> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>             client.setDefaultCollection("collection1");
>
>             //First initialize docs
>             SolrDocumentList docList = getDocs(client, 100);
>             Long count = 0L;
>
>             while (docList != null && docList.size() > 0) {
>
>                 List<SolrInputDocument> inList = new
> ArrayList<SolrInputDocument>();
>                 for(SolrDocument doc : docList) {
>
>                     SolrInputDocument iDoc =
> ClientUtils.toSolrInputDocument(doc);
>
>                     //This is my SOLR's Unique id
>                     String uniqueId = (String)
> iDoc.getFieldValue("uniqueId");
>
>                     /*
>                      * This is another system's id which is what I want to
> correct. Was messed
>                      * because of script transformer in DIH import via
> SolrEntityProcessor
>                      * ex-
> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>                      */
>                     String uuid = (String) iDoc.getFieldValue("uuid");
>                     String sanitizedUUID =
> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>                     Map<String,String> fieldModifier = new
> HashMap<String,String>(1);
>                     fieldModifier.put("set",sanitizedUUID);
>                     iDoc.setField("uuid", fieldModifier);
>
>                     inList.add(iDoc);
>                     log.info("added " + systemid);
>                 }
>
>                 client.add(inList);
>
>                 count = count + docList.size();
>                 log.info("Indexed " + count + "/" +
> docList.getNumFound());
>
>                 Thread.sleep(5000);
>
>                 docList = getDocs(client, docList.size());
>                 log.info("Got Docs- " + docList.getNumFound());
>             }
>
>         } catch (Exception e) {
>             log.error("Error indexing ", e);
>         }
>     }
>
>     private static SolrDocumentList getDocs(CloudSolrClient client,
> Integer rows) {
>
>
>         SolrQuery q = new SolrQuery("*:*");
>         q.setSort("publishtime", ORDER.desc);
>         q.setStart(0);
>         q.setRows(rows);
>         q.addFilterQuery(new String[] {"uuid:[* TO *]",
> "uuid:sun.org.mozilla*"});
>         q.setFields(new String[]{"uniqueId","uuid"});
>         SolrDocumentList docList = null;
>         QueryResponse resp;
>         try {
>             resp = client.query(q);
>             docList = resp.getResults();
>         } catch (Exception e) {
>             log.error("Error querying " + q.toString(), e);
>         }
>         return docList;
>     }
>
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Wait, query again how? You've got to have something that keeps you
>> from getting the same 100 docs back so you have to be sorting somehow.
>> Or you have a high water mark. Or something. Waiting 5 seconds for any
>> commit also doesn't really make sense to me. I mean how do you know
>>
>> 1> that you're going to get a commit (did you explicitly send one from
>> the client?).
>> 2> all autowarming will be complete by the time the next query hits?
>>
>> Let's see the query you fire. There has to be some kind of marker that
>> you're using to know when you've gotten through the entire set.
>>
>> And I would use much larger batches, I usually update in batches of
>> 1,000 (excepting if these are very large docs of course). I suspect
>> you're spending a lot more time sleeping than you need to. I wouldn't
>> sleep at all in fact. This is one (rare) case I might consider
>> committing from the client. If you specify the wait for searcher param
>> (server.commit(true, true), then it doesn't return until a new
>> searcher is completely opened so your previous updates will be
>> reflected in your next search.
>>
>> Actually, what I'd really do is
>> 1> turn off all auto commits
>> 2> go ahead and query/change/update. But the query bits would be using
>> the cursormark.
>> 3> do NOT commit
>> 4> issue a commit when you were all done.
>>
>> I bet you'd get through your update a lot faster that way.
>>
>> Best,
>> Erick
>>
>> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ra...@gmail.com> wrote:
>> > Thanks for responding Erick. I set the "start" to zero and "rows"
>> always to
>> > 100. I create CloudSolrClient instance and use it to both query as well
>> as
>> > index. But I do sleep for 5 secs just to allow for any auto commits.
>> >
>> > So query --> client.add(100 docs) --> wait --> query again
>> >
>> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
>> > docs the "query again" returns zero docs causing my while loop to
>> > exist...so was trying to see if I was doing the right thing or if there
>> is
>> > an alternate way to do heavy indexing.
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>> >
>> >
>> > On Friday, September 25, 2015, Erick Erickson <er...@gmail.com>
>> > wrote:
>> >
>> >> How are you querying Solr? You say you query for 100 docs,
>> >> update then get the next set. What are you using for a marker?
>> >> If you're using the start parameter, and somehow a commit is
>> >> creeping in things might be weird, especially if you're using any
>> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
>> >> are taking place even that should be OK.
>> >>
>> >> The "deep paging" stuff could be helpful here, see:
>> >>
>> >>
>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravisolr@gmail.com
>> >> <javascript:;>> wrote:
>> >> > No problem Walter, it's all fun. Was just wondering if there was some
>> >> other
>> >> > good way that I did not know of, that's all 😀
>> >> >
>> >> > Thanks
>> >> >
>> >> > Ravi Kiran Bhaskar
>> >> >
>> >> > On Friday, September 25, 2015, Walter Underwood <
>> wunder@wunderwood.org
>> >> <javascript:;>>
>> >> > wrote:
>> >> >
>> >> >> Sorry, I did not mean to be rude. The original question did not say
>> that
>> >> >> you don’t have the docs outside of Solr. Some people jump to the
>> >> advanced
>> >> >> features and miss the simple ones.
>> >> >>
>> >> >> It might be faster to fetch all the docs from Solr and save them in
>> >> files.
>> >> >> Then modify them. Then reload all of them. No guarantee, but it is
>> >> worth a
>> >> >> try.
>> >> >>
>> >> >> Good luck.
>> >> >>
>> >> >> wunder
>> >> >> Walter Underwood
>> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>> >> >> http://observer.wunderwood.org/  (my blog)
>> >> >>
>> >> >>
>> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
>> >> <javascript:;>
>> >> >> <javascript:;>> wrote:
>> >> >> >
>> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
>> friday
>> >> and
>> >> >> > Iam stuck here trying to figure reindexing issues :-)
>> >> >> > I dont have source of docs so I have to query the SOLR, modify and
>> >> put it
>> >> >> > back and that is seeming to be quite a task in 5.3.0, I did
>> reindex
>> >> >> several
>> >> >> > times with 4.7.2 in a master slave env without any issue. Since
>> then
>> >> we
>> >> >> > have moved to cloud and it has been a pain all day.
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >> > Ravi Kiran Bhaskar
>> >> >> >
>> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>> >> wunder@wunderwood.org <javascript:;>
>> >> >> <javascript:;>>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Sure.
>> >> >> >>
>> >> >> >> 1. Delete all the docs (no commit).
>> >> >> >> 2. Add all the docs (no commit).
>> >> >> >> 3. Commit.
>> >> >> >>
>> >> >> >> wunder
>> >> >> >> Walter Underwood
>> >> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>> >> >> >> http://observer.wunderwood.org/  (my blog)
>> >> >> >>
>> >> >> >>
>> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
>> >> <javascript:;>
>> >> >> <javascript:;>> wrote:
>> >> >> >>>
>> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
>> one
>> >> of
>> >> >> the
>> >> >> >>> field needed part of string value removed (accidentally
>> >> introduced). I
>> >> >> >> was
>> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
>> doc
>> >> >> >> (atomic
>> >> >> >>> update with "set") via the CloudSolrClient in batches, However
>> from
>> >> >> time
>> >> >> >> to
>> >> >> >>> time the query returns 0 results, which exits the re-indexing
>> >> program.
>> >> >> >>>
>> >> >> >>> I cant understand as to why the cloud returns 0 results when
>> there
>> >> are
>> >> >> >> 1.4x
>> >> >> >>> million docs which have the "accidental" string in them.
>> >> >> >>>
>> >> >> >>> Is there another way to do bulk massive updates ?
>> >> >> >>>
>> >> >> >>> Thanks
>> >> >> >>>
>> >> >> >>> Ravi Kiran Bhaskar
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>>
>
>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

thank you for taking time to help me out. Yes I was not using cursorMark, I
will try that next. This is what I was doing, its a bit shabby coding but
what can I say my brain was fried :-) FYI this is a side process just to
correct a messed up string. The actual indexing process was working all the
time as our business owners are a bit petulant about stopping indexing. My
autocommit conf and code is given below, as you can see autocommit should
fire every 100 docs anyway

    <autoCommit>
       <maxDocs>100</maxDocs>
       <maxTime>120000</maxTime>
    </autoCommit>

    <autoSoftCommit>
        <maxTime>30000</maxTime>
    </autoSoftCommit>
  </updateHandler>

    private static void processDocs() {

        try {
            CloudSolrClient client = new
CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
            client.setDefaultCollection("collection1");

            //First initialize docs
            SolrDocumentList docList = getDocs(client, 100);
            Long count = 0L;

            while (docList != null && docList.size() > 0) {

                List<SolrInputDocument> inList = new
ArrayList<SolrInputDocument>();
                for(SolrDocument doc : docList) {

                    SolrInputDocument iDoc =
ClientUtils.toSolrInputDocument(doc);

                    //This is my SOLR's Unique id
                    String uniqueId = (String)
iDoc.getFieldValue("uniqueId");

                    /*
                     * This is another system's id which is what I want to
correct. Was messed
                     * because of script transformer in DIH import via
SolrEntityProcessor
                     * ex-
sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
                     */
                    String uuid = (String) iDoc.getFieldValue("uuid");
                    String sanitizedUUID =
uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
                    Map<String,String> fieldModifier = new
HashMap<String,String>(1);
                    fieldModifier.put("set",sanitizedUUID);
                    iDoc.setField("uuid", fieldModifier);

                    inList.add(iDoc);
                    log.info("added " + systemid);
                }

                client.add(inList);

                count = count + docList.size();
                log.info("Indexed " + count + "/" + docList.getNumFound());

                Thread.sleep(5000);

                docList = getDocs(client, docList.size());
                log.info("Got Docs- " + docList.getNumFound());
            }

        } catch (Exception e) {
            log.error("Error indexing ", e);
        }
    }

    private static SolrDocumentList getDocs(CloudSolrClient client, Integer
rows) {


        SolrQuery q = new SolrQuery("*:*");
        q.setSort("publishtime", ORDER.desc);
        q.setStart(0);
        q.setRows(rows);
        q.addFilterQuery(new String[] {"uuid:[* TO *]",
"uuid:sun.org.mozilla*"});
        q.setFields(new String[]{"uniqueId","uuid"});
        SolrDocumentList docList = null;
        QueryResponse resp;
        try {
            resp = client.query(q);
            docList = resp.getResults();
        } catch (Exception e) {
            log.error("Error querying " + q.toString(), e);
        }
        return docList;
    }


Thanks

Ravi Kiran Bhaskar

On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <er...@gmail.com>
wrote:

> Wait, query again how? You've got to have something that keeps you
> from getting the same 100 docs back so you have to be sorting somehow.
> Or you have a high water mark. Or something. Waiting 5 seconds for any
> commit also doesn't really make sense to me. I mean how do you know
>
> 1> that you're going to get a commit (did you explicitly send one from
> the client?).
> 2> all autowarming will be complete by the time the next query hits?
>
> Let's see the query you fire. There has to be some kind of marker that
> you're using to know when you've gotten through the entire set.
>
> And I would use much larger batches, I usually update in batches of
> 1,000 (excepting if these are very large docs of course). I suspect
> you're spending a lot more time sleeping than you need to. I wouldn't
> sleep at all in fact. This is one (rare) case I might consider
> committing from the client. If you specify the wait for searcher param
> (server.commit(true, true), then it doesn't return until a new
> searcher is completely opened so your previous updates will be
> reflected in your next search.
>
> Actually, what I'd really do is
> 1> turn off all auto commits
> 2> go ahead and query/change/update. But the query bits would be using
> the cursormark.
> 3> do NOT commit
> 4> issue a commit when you were all done.
>
> I bet you'd get through your update a lot faster that way.
>
> Best,
> Erick
>
> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ra...@gmail.com> wrote:
> > Thanks for responding Erick. I set the "start" to zero and "rows" always
> to
> > 100. I create CloudSolrClient instance and use it to both query as well
> as
> > index. But I do sleep for 5 secs just to allow for any auto commits.
> >
> > So query --> client.add(100 docs) --> wait --> query again
> >
> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
> > docs the "query again" returns zero docs causing my while loop to
> > exist...so was trying to see if I was doing the right thing or if there
> is
> > an alternate way to do heavy indexing.
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> >
> >
> > On Friday, September 25, 2015, Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> How are you querying Solr? You say you query for 100 docs,
> >> update then get the next set. What are you using for a marker?
> >> If you're using the start parameter, and somehow a commit is
> >> creeping in things might be weird, especially if you're using any
> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
> >> are taking place even that should be OK.
> >>
> >> The "deep paging" stuff could be helpful here, see:
> >>
> >>
> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravisolr@gmail.com
> >> <javascript:;>> wrote:
> >> > No problem Walter, it's all fun. Was just wondering if there was some
> >> other
> >> > good way that I did not know of, that's all 😀
> >> >
> >> > Thanks
> >> >
> >> > Ravi Kiran Bhaskar
> >> >
> >> > On Friday, September 25, 2015, Walter Underwood <
> wunder@wunderwood.org
> >> <javascript:;>>
> >> > wrote:
> >> >
> >> >> Sorry, I did not mean to be rude. The original question did not say
> that
> >> >> you don’t have the docs outside of Solr. Some people jump to the
> >> advanced
> >> >> features and miss the simple ones.
> >> >>
> >> >> It might be faster to fetch all the docs from Solr and save them in
> >> files.
> >> >> Then modify them. Then reload all of them. No guarantee, but it is
> >> worth a
> >> >> try.
> >> >>
> >> >> Good luck.
> >> >>
> >> >> wunder
> >> >> Walter Underwood
> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
> >> >> http://observer.wunderwood.org/  (my blog)
> >> >>
> >> >>
> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
> >> <javascript:;>
> >> >> <javascript:;>> wrote:
> >> >> >
> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
> friday
> >> and
> >> >> > Iam stuck here trying to figure reindexing issues :-)
> >> >> > I dont have source of docs so I have to query the SOLR, modify and
> >> put it
> >> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
> >> >> several
> >> >> > times with 4.7.2 in a master slave env without any issue. Since
> then
> >> we
> >> >> > have moved to cloud and it has been a pain all day.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > Ravi Kiran Bhaskar
> >> >> >
> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
> >> wunder@wunderwood.org <javascript:;>
> >> >> <javascript:;>>
> >> >> > wrote:
> >> >> >
> >> >> >> Sure.
> >> >> >>
> >> >> >> 1. Delete all the docs (no commit).
> >> >> >> 2. Add all the docs (no commit).
> >> >> >> 3. Commit.
> >> >> >>
> >> >> >> wunder
> >> >> >> Walter Underwood
> >> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
> >> >> >> http://observer.wunderwood.org/  (my blog)
> >> >> >>
> >> >> >>
> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
> >> <javascript:;>
> >> >> <javascript:;>> wrote:
> >> >> >>>
> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
> one
> >> of
> >> >> the
> >> >> >>> field needed part of string value removed (accidentally
> >> introduced). I
> >> >> >> was
> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
> doc
> >> >> >> (atomic
> >> >> >>> update with "set") via the CloudSolrClient in batches, However
> from
> >> >> time
> >> >> >> to
> >> >> >>> time the query returns 0 results, which exits the re-indexing
> >> program.
> >> >> >>>
> >> >> >>> I cant understand as to why the cloud returns 0 results when
> there
> >> are
> >> >> >> 1.4x
> >> >> >>> million docs which have the "accidental" string in them.
> >> >> >>>
> >> >> >>> Is there another way to do bulk massive updates ?
> >> >> >>>
> >> >> >>> Thanks
> >> >> >>>
> >> >> >>> Ravi Kiran Bhaskar
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >>
>

Re: bulk reindexing 5.3.0 issue

Posted by Erick Erickson <er...@gmail.com>.

Wait, query again how? You've got to have something that keeps you
from getting the same 100 docs back so you have to be sorting somehow.
Or you have a high water mark. Or something. Waiting 5 seconds for any
commit also doesn't really make sense to me. I mean how do you know

1> that you're going to get a commit (did you explicitly send one from
the client?).
2> all autowarming will be complete by the time the next query hits?

Let's see the query you fire. There has to be some kind of marker that
you're using to know when you've gotten through the entire set.

And I would use much larger batches, I usually update in batches of
1,000 (excepting if these are very large docs of course). I suspect
you're spending a lot more time sleeping than you need to. I wouldn't
sleep at all in fact. This is one (rare) case I might consider
committing from the client. If you specify the wait for searcher param
(server.commit(true, true), then it doesn't return until a new
searcher is completely opened so your previous updates will be
reflected in your next search.

Actually, what I'd really do is
1> turn off all auto commits
2> go ahead and query/change/update. But the query bits would be using
the cursormark.
3> do NOT commit
4> issue a commit when you were all done.

I bet you'd get through your update a lot faster that way.

Best,
Erick

On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ra...@gmail.com> wrote:
> Thanks for responding Erick. I set the "start" to zero and "rows" always to
> 100. I create CloudSolrClient instance and use it to both query as well as
> index. But I do sleep for 5 secs just to allow for any auto commits.
>
> So query --> client.add(100 docs) --> wait --> query again
>
> But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
> docs the "query again" returns zero docs causing my while loop to
> exist...so was trying to see if I was doing the right thing or if there is
> an alternate way to do heavy indexing.
>
> Thanks
>
> Ravi Kiran Bhaskar
>
>
>
> On Friday, September 25, 2015, Erick Erickson <er...@gmail.com>
> wrote:
>
>> How are you querying Solr? You say you query for 100 docs,
>> update then get the next set. What are you using for a marker?
>> If you're using the start parameter, and somehow a commit is
>> creeping in things might be weird, especially if you're using any
>> of the internal Lucene doc IDs. If you're absolutely sure no commits
>> are taking place even that should be OK.
>>
>> The "deep paging" stuff could be helpful here, see:
>>
>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>
>> Best,
>> Erick
>>
>> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravisolr@gmail.com
>> <javascript:;>> wrote:
>> > No problem Walter, it's all fun. Was just wondering if there was some
>> other
>> > good way that I did not know of, that's all 😀
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>> > On Friday, September 25, 2015, Walter Underwood <wunder@wunderwood.org
>> <javascript:;>>
>> > wrote:
>> >
>> >> Sorry, I did not mean to be rude. The original question did not say that
>> >> you don’t have the docs outside of Solr. Some people jump to the
>> advanced
>> >> features and miss the simple ones.
>> >>
>> >> It might be faster to fetch all the docs from Solr and save them in
>> files.
>> >> Then modify them. Then reload all of them. No guarantee, but it is
>> worth a
>> >> try.
>> >>
>> >> Good luck.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>
>> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
>> <javascript:;>
>> >> <javascript:;>> wrote:
>> >> >
>> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a friday
>> and
>> >> > Iam stuck here trying to figure reindexing issues :-)
>> >> > I dont have source of docs so I have to query the SOLR, modify and
>> put it
>> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
>> >> several
>> >> > times with 4.7.2 in a master slave env without any issue. Since then
>> we
>> >> > have moved to cloud and it has been a pain all day.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Ravi Kiran Bhaskar
>> >> >
>> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>> wunder@wunderwood.org <javascript:;>
>> >> <javascript:;>>
>> >> > wrote:
>> >> >
>> >> >> Sure.
>> >> >>
>> >> >> 1. Delete all the docs (no commit).
>> >> >> 2. Add all the docs (no commit).
>> >> >> 3. Commit.
>> >> >>
>> >> >> wunder
>> >> >> Walter Underwood
>> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
>> >> >> http://observer.wunderwood.org/  (my blog)
>> >> >>
>> >> >>
>> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
>> <javascript:;>
>> >> <javascript:;>> wrote:
>> >> >>>
>> >> >>> I have been trying to re-index the docs (about 1.5 million) as one
>> of
>> >> the
>> >> >>> field needed part of string value removed (accidentally
>> introduced). I
>> >> >> was
>> >> >>> issuing a query for 100 docs getting 4 fields and updating the doc
>> >> >> (atomic
>> >> >>> update with "set") via the CloudSolrClient in batches, However from
>> >> time
>> >> >> to
>> >> >>> time the query returns 0 results, which exits the re-indexing
>> program.
>> >> >>>
>> >> >>> I cant understand as to why the cloud returns 0 results when there
>> are
>> >> >> 1.4x
>> >> >>> million docs which have the "accidental" string in them.
>> >> >>>
>> >> >>> Is there another way to do bulk massive updates ?
>> >> >>>
>> >> >>> Thanks
>> >> >>>
>> >> >>> Ravi Kiran Bhaskar
>> >> >>
>> >> >>
>> >>
>> >>
>>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Thanks for responding Erick. I set the "start" to zero and "rows" always to
100. I create CloudSolrClient instance and use it to both query as well as
index. But I do sleep for 5 secs just to allow for any auto commits.

So query --> client.add(100 docs) --> wait --> query again

But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
docs the "query again" returns zero docs causing my while loop to
exist...so was trying to see if I was doing the right thing or if there is
an alternate way to do heavy indexing.

Thanks

Ravi Kiran Bhaskar



On Friday, September 25, 2015, Erick Erickson <er...@gmail.com>
wrote:

> How are you querying Solr? You say you query for 100 docs,
> update then get the next set. What are you using for a marker?
> If you're using the start parameter, and somehow a commit is
> creeping in things might be weird, especially if you're using any
> of the internal Lucene doc IDs. If you're absolutely sure no commits
> are taking place even that should be OK.
>
> The "deep paging" stuff could be helpful here, see:
>
> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>
> Best,
> Erick
>
> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravisolr@gmail.com
> <javascript:;>> wrote:
> > No problem Walter, it's all fun. Was just wondering if there was some
> other
> > good way that I did not know of, that's all 😀
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> > On Friday, September 25, 2015, Walter Underwood <wunder@wunderwood.org
> <javascript:;>>
> > wrote:
> >
> >> Sorry, I did not mean to be rude. The original question did not say that
> >> you don’t have the docs outside of Solr. Some people jump to the
> advanced
> >> features and miss the simple ones.
> >>
> >> It might be faster to fetch all the docs from Solr and save them in
> files.
> >> Then modify them. Then reload all of them. No guarantee, but it is
> worth a
> >> try.
> >>
> >> Good luck.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org <javascript:;> <javascript:;>
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
> <javascript:;>
> >> <javascript:;>> wrote:
> >> >
> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a friday
> and
> >> > Iam stuck here trying to figure reindexing issues :-)
> >> > I dont have source of docs so I have to query the SOLR, modify and
> put it
> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
> >> several
> >> > times with 4.7.2 in a master slave env without any issue. Since then
> we
> >> > have moved to cloud and it has been a pain all day.
> >> >
> >> > Thanks
> >> >
> >> > Ravi Kiran Bhaskar
> >> >
> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
> wunder@wunderwood.org <javascript:;>
> >> <javascript:;>>
> >> > wrote:
> >> >
> >> >> Sure.
> >> >>
> >> >> 1. Delete all the docs (no commit).
> >> >> 2. Add all the docs (no commit).
> >> >> 3. Commit.
> >> >>
> >> >> wunder
> >> >> Walter Underwood
> >> >> wunder@wunderwood.org <javascript:;> <javascript:;>
> >> >> http://observer.wunderwood.org/  (my blog)
> >> >>
> >> >>
> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
> <javascript:;>
> >> <javascript:;>> wrote:
> >> >>>
> >> >>> I have been trying to re-index the docs (about 1.5 million) as one
> of
> >> the
> >> >>> field needed part of string value removed (accidentally
> introduced). I
> >> >> was
> >> >>> issuing a query for 100 docs getting 4 fields and updating the doc
> >> >> (atomic
> >> >>> update with "set") via the CloudSolrClient in batches, However from
> >> time
> >> >> to
> >> >>> time the query returns 0 results, which exits the re-indexing
> program.
> >> >>>
> >> >>> I cant understand as to why the cloud returns 0 results when there
> are
> >> >> 1.4x
> >> >>> million docs which have the "accidental" string in them.
> >> >>>
> >> >>> Is there another way to do bulk massive updates ?
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> Ravi Kiran Bhaskar
> >> >>
> >> >>
> >>
> >>
>

Re: bulk reindexing 5.3.0 issue

Posted by Erick Erickson <er...@gmail.com>.

How are you querying Solr? You say you query for 100 docs,
update then get the next set. What are you using for a marker?
If you're using the start parameter, and somehow a commit is
creeping in things might be weird, especially if you're using any
of the internal Lucene doc IDs. If you're absolutely sure no commits
are taking place even that should be OK.

The "deep paging" stuff could be helpful here, see:
https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Best,
Erick

On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ra...@gmail.com> wrote:
> No problem Walter, it's all fun. Was just wondering if there was some other
> good way that I did not know of, that's all 😀
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Friday, September 25, 2015, Walter Underwood <wu...@wunderwood.org>
> wrote:
>
>> Sorry, I did not mean to be rude. The original question did not say that
>> you don’t have the docs outside of Solr. Some people jump to the advanced
>> features and miss the simple ones.
>>
>> It might be faster to fetch all the docs from Solr and save them in files.
>> Then modify them. Then reload all of them. No guarantee, but it is worth a
>> try.
>>
>> Good luck.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <javascript:;>
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
>> <javascript:;>> wrote:
>> >
>> > Walter, Not in a mood for banter right now.... Its 6:00pm on a friday and
>> > Iam stuck here trying to figure reindexing issues :-)
>> > I dont have source of docs so I have to query the SOLR, modify and put it
>> > back and that is seeming to be quite a task in 5.3.0, I did reindex
>> several
>> > times with 4.7.2 in a master slave env without any issue. Since then we
>> > have moved to cloud and it has been a pain all day.
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <wunder@wunderwood.org
>> <javascript:;>>
>> > wrote:
>> >
>> >> Sure.
>> >>
>> >> 1. Delete all the docs (no commit).
>> >> 2. Add all the docs (no commit).
>> >> 3. Commit.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wunder@wunderwood.org <javascript:;>
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>
>> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
>> <javascript:;>> wrote:
>> >>>
>> >>> I have been trying to re-index the docs (about 1.5 million) as one of
>> the
>> >>> field needed part of string value removed (accidentally introduced). I
>> >> was
>> >>> issuing a query for 100 docs getting 4 fields and updating the doc
>> >> (atomic
>> >>> update with "set") via the CloudSolrClient in batches, However from
>> time
>> >> to
>> >>> time the query returns 0 results, which exits the re-indexing program.
>> >>>
>> >>> I cant understand as to why the cloud returns 0 results when there are
>> >> 1.4x
>> >>> million docs which have the "accidental" string in them.
>> >>>
>> >>> Is there another way to do bulk massive updates ?
>> >>>
>> >>> Thanks
>> >>>
>> >>> Ravi Kiran Bhaskar
>> >>
>> >>
>>
>>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

No problem Walter, it's all fun. Was just wondering if there was some other
good way that I did not know of, that's all 😀

Thanks

Ravi Kiran Bhaskar

On Friday, September 25, 2015, Walter Underwood <wu...@wunderwood.org>
wrote:

> Sorry, I did not mean to be rude. The original question did not say that
> you don’t have the docs outside of Solr. Some people jump to the advanced
> features and miss the simple ones.
>
> It might be faster to fetch all the docs from Solr and save them in files.
> Then modify them. Then reload all of them. No guarantee, but it is worth a
> try.
>
> Good luck.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <javascript:;>
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravisolr@gmail.com
> <javascript:;>> wrote:
> >
> > Walter, Not in a mood for banter right now.... Its 6:00pm on a friday and
> > Iam stuck here trying to figure reindexing issues :-)
> > I dont have source of docs so I have to query the SOLR, modify and put it
> > back and that is seeming to be quite a task in 5.3.0, I did reindex
> several
> > times with 4.7.2 in a master slave env without any issue. Since then we
> > have moved to cloud and it has been a pain all day.
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <wunder@wunderwood.org
> <javascript:;>>
> > wrote:
> >
> >> Sure.
> >>
> >> 1. Delete all the docs (no commit).
> >> 2. Add all the docs (no commit).
> >> 3. Commit.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org <javascript:;>
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravisolr@gmail.com
> <javascript:;>> wrote:
> >>>
> >>> I have been trying to re-index the docs (about 1.5 million) as one of
> the
> >>> field needed part of string value removed (accidentally introduced). I
> >> was
> >>> issuing a query for 100 docs getting 4 fields and updating the doc
> >> (atomic
> >>> update with "set") via the CloudSolrClient in batches, However from
> time
> >> to
> >>> time the query returns 0 results, which exits the re-indexing program.
> >>>
> >>> I cant understand as to why the cloud returns 0 results when there are
> >> 1.4x
> >>> million docs which have the "accidental" string in them.
> >>>
> >>> Is there another way to do bulk massive updates ?
> >>>
> >>> Thanks
> >>>
> >>> Ravi Kiran Bhaskar
> >>
> >>
>
>

Re: bulk reindexing 5.3.0 issue

Posted by Walter Underwood <wu...@wunderwood.org>.

Sorry, I did not mean to be rude. The original question did not say that you don’t have the docs outside of Solr. Some people jump to the advanced features and miss the simple ones.

It might be faster to fetch all the docs from Solr and save them in files. Then modify them. Then reload all of them. No guarantee, but it is worth a try.

Good luck.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 2:59 PM, Ravi Solr <ra...@gmail.com> wrote:
> 
> Walter, Not in a mood for banter right now.... Its 6:00pm on a friday and
> Iam stuck here trying to figure reindexing issues :-)
> I dont have source of docs so I have to query the SOLR, modify and put it
> back and that is seeming to be quite a task in 5.3.0, I did reindex several
> times with 4.7.2 in a master slave env without any issue. Since then we
> have moved to cloud and it has been a pain all day.
> 
> Thanks
> 
> Ravi Kiran Bhaskar
> 
> On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> Sure.
>> 
>> 1. Delete all the docs (no commit).
>> 2. Add all the docs (no commit).
>> 3. Commit.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ra...@gmail.com> wrote:
>>> 
>>> I have been trying to re-index the docs (about 1.5 million) as one of the
>>> field needed part of string value removed (accidentally introduced). I
>> was
>>> issuing a query for 100 docs getting 4 fields and updating the doc
>> (atomic
>>> update with "set") via the CloudSolrClient in batches, However from time
>> to
>>> time the query returns 0 results, which exits the re-indexing program.
>>> 
>>> I cant understand as to why the cloud returns 0 results when there are
>> 1.4x
>>> million docs which have the "accidental" string in them.
>>> 
>>> Is there another way to do bulk massive updates ?
>>> 
>>> Thanks
>>> 
>>> Ravi Kiran Bhaskar
>> 
>>

Re: bulk reindexing 5.3.0 issue

Posted by Ravi Solr <ra...@gmail.com>.

Walter, Not in a mood for banter right now.... Its 6:00pm on a friday and
Iam stuck here trying to figure reindexing issues :-)
I dont have source of docs so I have to query the SOLR, modify and put it
back and that is seeming to be quite a task in 5.3.0, I did reindex several
times with 4.7.2 in a master slave env without any issue. Since then we
have moved to cloud and it has been a pain all day.

Thanks

Ravi Kiran Bhaskar

On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> Sure.
>
> 1. Delete all the docs (no commit).
> 2. Add all the docs (no commit).
> 3. Commit.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Sep 25, 2015, at 2:17 PM, Ravi Solr <ra...@gmail.com> wrote:
> >
> > I have been trying to re-index the docs (about 1.5 million) as one of the
> > field needed part of string value removed (accidentally introduced). I
> was
> > issuing a query for 100 docs getting 4 fields and updating the doc
> (atomic
> > update with "set") via the CloudSolrClient in batches, However from time
> to
> > time the query returns 0 results, which exits the re-indexing program.
> >
> > I cant understand as to why the cloud returns 0 results when there are
> 1.4x
> > million docs which have the "accidental" string in them.
> >
> > Is there another way to do bulk massive updates ?
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
>
>

Re: bulk reindexing 5.3.0 issue

Posted by Walter Underwood <wu...@wunderwood.org>.

Sure.

1. Delete all the docs (no commit).
2. Add all the docs (no commit).
3. Commit.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ra...@gmail.com> wrote:
> 
> I have been trying to re-index the docs (about 1.5 million) as one of the
> field needed part of string value removed (accidentally introduced). I was
> issuing a query for 100 docs getting 4 fields and updating the doc  (atomic
> update with "set") via the CloudSolrClient in batches, However from time to
> time the query returns 0 results, which exits the re-indexing program.
> 
> I cant understand as to why the cloud returns 0 results when there are 1.4x
> million docs which have the "accidental" string in them.
> 
> Is there another way to do bulk massive updates ?
> 
> Thanks
> 
> Ravi Kiran Bhaskar