You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2013/11/20 13:33:42 UTC

Auto optimized of Solr indexing results

Hi,

After successfully configured re-crawling script, I sometimes checked and
found on Solr Admin that "Optimized" status of my collection is not
optimized (slash icon).

Hence I did optimized steps manually.

How to make my crawling optimized automatically?

Should we restart Solr (I use Tomcat) as shown on here [1]

[1] http://wiki.apache.org/nutch/Crawl

Thanks!

-- 
wassalam,
[bayu]

Re: Auto optimized of Solr indexing results

Posted by Bayu Widyasanyata <bw...@gmail.com>.
Thanks Erick for your advance and share.

Regards,


On Mon, Dec 2, 2013 at 11:06 PM, Erick Erickson <er...@gmail.com>wrote:

> TieredMergePolicy is the default even though it's
> commented out in solrconfig, it's still being used.
> So there's nothing to do.
>
> Given the size of your index,  you can actually do
> whatever you please. Optimizing it will shrink its size,
> but frankly your index is so small I doubt you'll see any
> noticeable difference. They'll self-purge as you re-crawl
> eventually.
>
> In all, I think you can mostly ignore the issue.
>
> Best,
> Erick
>
>
> On Sun, Dec 1, 2013 at 8:00 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Hi Erick,
> >
> > After waiting for some days abt. a week (I did daily crawling &
> indexing),
> > here are the docs summary:
> >
> > Num Docs:   9738
> > Max Doc:   15311
> > Deleted Docs: 5573
> > Version: 781
> > Segment Count: 5
> >
> > The percentage of deletedDocs of NumDocs is near 57%.
> >
> > In the other, the TieredMergePolicy in solrconfig.xml is still disabled.
> >
> >     <!--
> >         <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >           <int name="maxMergeAtOnce">10</int>
> >           <int name="segmentsPerTier">10</int>
> >         </mergePolicy>
> >       -->
> >
> > Should we enable it and wait for the effect?
> >
> > Thanks!
> >
> >
> >
> > On Wed, Nov 20, 2013 at 9:55 PM, Bayu Widyasanyata
> > <bw...@gmail.com>wrote:
> >
> > > Thanks Erick.
> > > I will check that on next round.
> > >
> > > ---
> > > wassalam,
> > > [bayu]
> > >
> > > /sent from Android phone/
> > > On Nov 20, 2013 7:45 PM, "Erick Erickson" <er...@gmail.com>
> > wrote:
> > >
> > >> You probably shouldn't optimize at all. The default TieredMergePolicy
> > >> will eventually purge the deleted files' data, which is really what
> > >> optimize
> > >> does. So despite its name, most of the time it's not really worth the
> > >> effort.
> > >>
> > >> Take a look at your Solr admin page, the "overview" link for a core.
> > >> If the number of deleted docs is a significant percentage of your
> > >> numDocs (I typically use 20% or so, but YMMV) then optimize
> > >> might be worthwhile. Otherwise, it's a distraction unless and until
> > >> you have some evidence that it actually makes a difference.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>
> > >> On Wed, Nov 20, 2013 at 7:33 AM, Bayu Widyasanyata
> > >> <bw...@gmail.com>wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > After successfully configured re-crawling script, I sometimes
> checked
> > >> and
> > >> > found on Solr Admin that "Optimized" status of my collection is not
> > >> > optimized (slash icon).
> > >> >
> > >> > Hence I did optimized steps manually.
> > >> >
> > >> > How to make my crawling optimized automatically?
> > >> >
> > >> > Should we restart Solr (I use Tomcat) as shown on here [1]
> > >> >
> > >> > [1] http://wiki.apache.org/nutch/Crawl
> > >> >
> > >> > Thanks!
> > >> >
> > >> > --
> > >> > wassalam,
> > >> > [bayu]
> > >> >
> > >>
> > >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>



-- 
wassalam,
[bayu]

Re: Auto optimized of Solr indexing results

Posted by Erick Erickson <er...@gmail.com>.
TieredMergePolicy is the default even though it's
commented out in solrconfig, it's still being used.
So there's nothing to do.

Given the size of your index,  you can actually do
whatever you please. Optimizing it will shrink its size,
but frankly your index is so small I doubt you'll see any
noticeable difference. They'll self-purge as you re-crawl
eventually.

In all, I think you can mostly ignore the issue.

Best,
Erick


On Sun, Dec 1, 2013 at 8:00 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi Erick,
>
> After waiting for some days abt. a week (I did daily crawling & indexing),
> here are the docs summary:
>
> Num Docs:   9738
> Max Doc:   15311
> Deleted Docs: 5573
> Version: 781
> Segment Count: 5
>
> The percentage of deletedDocs of NumDocs is near 57%.
>
> In the other, the TieredMergePolicy in solrconfig.xml is still disabled.
>
>     <!--
>         <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>           <int name="maxMergeAtOnce">10</int>
>           <int name="segmentsPerTier">10</int>
>         </mergePolicy>
>       -->
>
> Should we enable it and wait for the effect?
>
> Thanks!
>
>
>
> On Wed, Nov 20, 2013 at 9:55 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Thanks Erick.
> > I will check that on next round.
> >
> > ---
> > wassalam,
> > [bayu]
> >
> > /sent from Android phone/
> > On Nov 20, 2013 7:45 PM, "Erick Erickson" <er...@gmail.com>
> wrote:
> >
> >> You probably shouldn't optimize at all. The default TieredMergePolicy
> >> will eventually purge the deleted files' data, which is really what
> >> optimize
> >> does. So despite its name, most of the time it's not really worth the
> >> effort.
> >>
> >> Take a look at your Solr admin page, the "overview" link for a core.
> >> If the number of deleted docs is a significant percentage of your
> >> numDocs (I typically use 20% or so, but YMMV) then optimize
> >> might be worthwhile. Otherwise, it's a distraction unless and until
> >> you have some evidence that it actually makes a difference.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >> On Wed, Nov 20, 2013 at 7:33 AM, Bayu Widyasanyata
> >> <bw...@gmail.com>wrote:
> >>
> >> > Hi,
> >> >
> >> > After successfully configured re-crawling script, I sometimes checked
> >> and
> >> > found on Solr Admin that "Optimized" status of my collection is not
> >> > optimized (slash icon).
> >> >
> >> > Hence I did optimized steps manually.
> >> >
> >> > How to make my crawling optimized automatically?
> >> >
> >> > Should we restart Solr (I use Tomcat) as shown on here [1]
> >> >
> >> > [1] http://wiki.apache.org/nutch/Crawl
> >> >
> >> > Thanks!
> >> >
> >> > --
> >> > wassalam,
> >> > [bayu]
> >> >
> >>
> >
>
>
> --
> wassalam,
> [bayu]
>

Re: Auto optimized of Solr indexing results

Posted by Bayu Widyasanyata <bw...@gmail.com>.
Hi Erick,

After waiting for some days abt. a week (I did daily crawling & indexing),
here are the docs summary:

Num Docs:   9738
Max Doc:   15311
Deleted Docs: 5573
Version: 781
Segment Count: 5

The percentage of deletedDocs of NumDocs is near 57%.

In the other, the TieredMergePolicy in solrconfig.xml is still disabled.

    <!--
        <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
          <int name="maxMergeAtOnce">10</int>
          <int name="segmentsPerTier">10</int>
        </mergePolicy>
      -->

Should we enable it and wait for the effect?

Thanks!



On Wed, Nov 20, 2013 at 9:55 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Thanks Erick.
> I will check that on next round.
>
> ---
> wassalam,
> [bayu]
>
> /sent from Android phone/
> On Nov 20, 2013 7:45 PM, "Erick Erickson" <er...@gmail.com> wrote:
>
>> You probably shouldn't optimize at all. The default TieredMergePolicy
>> will eventually purge the deleted files' data, which is really what
>> optimize
>> does. So despite its name, most of the time it's not really worth the
>> effort.
>>
>> Take a look at your Solr admin page, the "overview" link for a core.
>> If the number of deleted docs is a significant percentage of your
>> numDocs (I typically use 20% or so, but YMMV) then optimize
>> might be worthwhile. Otherwise, it's a distraction unless and until
>> you have some evidence that it actually makes a difference.
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Nov 20, 2013 at 7:33 AM, Bayu Widyasanyata
>> <bw...@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > After successfully configured re-crawling script, I sometimes checked
>> and
>> > found on Solr Admin that "Optimized" status of my collection is not
>> > optimized (slash icon).
>> >
>> > Hence I did optimized steps manually.
>> >
>> > How to make my crawling optimized automatically?
>> >
>> > Should we restart Solr (I use Tomcat) as shown on here [1]
>> >
>> > [1] http://wiki.apache.org/nutch/Crawl
>> >
>> > Thanks!
>> >
>> > --
>> > wassalam,
>> > [bayu]
>> >
>>
>


-- 
wassalam,
[bayu]

Re: Auto optimized of Solr indexing results

Posted by Bayu Widyasanyata <bw...@gmail.com>.
Thanks Erick.
I will check that on next round.

---
wassalam,
[bayu]

/sent from Android phone/
On Nov 20, 2013 7:45 PM, "Erick Erickson" <er...@gmail.com> wrote:

> You probably shouldn't optimize at all. The default TieredMergePolicy
> will eventually purge the deleted files' data, which is really what
> optimize
> does. So despite its name, most of the time it's not really worth the
> effort.
>
> Take a look at your Solr admin page, the "overview" link for a core.
> If the number of deleted docs is a significant percentage of your
> numDocs (I typically use 20% or so, but YMMV) then optimize
> might be worthwhile. Otherwise, it's a distraction unless and until
> you have some evidence that it actually makes a difference.
>
> Best,
> Erick
>
>
> On Wed, Nov 20, 2013 at 7:33 AM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Hi,
> >
> > After successfully configured re-crawling script, I sometimes checked and
> > found on Solr Admin that "Optimized" status of my collection is not
> > optimized (slash icon).
> >
> > Hence I did optimized steps manually.
> >
> > How to make my crawling optimized automatically?
> >
> > Should we restart Solr (I use Tomcat) as shown on here [1]
> >
> > [1] http://wiki.apache.org/nutch/Crawl
> >
> > Thanks!
> >
> > --
> > wassalam,
> > [bayu]
> >
>

Re: Auto optimized of Solr indexing results

Posted by Erick Erickson <er...@gmail.com>.
You probably shouldn't optimize at all. The default TieredMergePolicy
will eventually purge the deleted files' data, which is really what optimize
does. So despite its name, most of the time it's not really worth the
effort.

Take a look at your Solr admin page, the "overview" link for a core.
If the number of deleted docs is a significant percentage of your
numDocs (I typically use 20% or so, but YMMV) then optimize
might be worthwhile. Otherwise, it's a distraction unless and until
you have some evidence that it actually makes a difference.

Best,
Erick


On Wed, Nov 20, 2013 at 7:33 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi,
>
> After successfully configured re-crawling script, I sometimes checked and
> found on Solr Admin that "Optimized" status of my collection is not
> optimized (slash icon).
>
> Hence I did optimized steps manually.
>
> How to make my crawling optimized automatically?
>
> Should we restart Solr (I use Tomcat) as shown on here [1]
>
> [1] http://wiki.apache.org/nutch/Crawl
>
> Thanks!
>
> --
> wassalam,
> [bayu]
>