You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by stone2dbone <an...@gmail.com> on 2013/08/01 20:17:24 UTC

Skip Indexing Certain Files on Purpose

I'm using Nutch 1.6 to retrieve metadata from crawled documents (e.g. .doc,
.ppt, .pdf, etc.) for indexing by Solr 4.0. Several of the crawled files
have no value or a junk value for certain metatags. Is there a way to force
Solr to skip indexing of documents where, say metatag.title is empty or
metatag.title is 'Slide 1'?



--
View this message in context: http://lucene.472066.n3.nabble.com/Skip-Indexing-Certain-Files-on-Purpose-tp4082026.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: "optimize" index : impact on performance

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/2/2013 8:13 AM, Anca Kopetz wrote:
> Then we optimized the index to 1 segment / 0 deleted docs and we got
> +40% of QPS compared to the previous test.
>
> Therefore we thought of optimizing the index every two hours, as our
> index is evolving due to frequent commits (every 30 minutes) and thus
> the performance results are degrading.
>
> 1. Is this a good practice ?
> 2. Instead of executing an "optimize" many times a day, are there any
> other parameters that we can tune and test in order to gain in average QPS?
>
> We want to avoid the solution of adding more servers to our SolrCloud
> cluster.
>
> Some details of our system :
>
> SolrCloud cluster: 8 nodes on 8 dedicated servers; 2 shards / 4 replicas
> Hardware configuration: 2 Processors (16CPU cores) per server; 24GB of
> memory; 6GB allocated to JVM
> Index: 13M documents, 15GB
> Search algorithm : grouping, faceting, filter queries
> Solr version 4.4

Please read and follow this note about thread hijacking:

http://people.apache.org/~hossman/#threadhijack

Optimizing that frequently with an index that large *might* cause more 
problems than it solves.  You'd have to actually try it to see whether 
it works for you, though.  Here's some information explaining why it may 
be a problem:

Optimizing a 15GB index is likely to take up to 15 minutes, depending on 
how fast the I/O subsystem on your servers is.  It probably won't happen 
in less than 5 minutes unless you're running on SSD, which also 
mitigates some of the impact described in the next paragraph.

Performance will be lower, potentially a LOT lower, for those few 
minutes while an optimize is occurring.  Solr has to read the index, 
process each document, and write it back out.  It does happen quite 
fast, but that's a lot of I/O.  Because it's continually going back and 
forth between the old copy and the new copy, the OS disk cache will have 
critical data evicted for the entire process, unless you have enough 
free RAM so *twice* the index can fit in the cache, and from your 
mentioned stats, you don't.

FYI, commits every 30 minutes are NOT frequent.  Commits happening one 
or more times every *second* are frequent.

If you can share your solrconfig.xml, there might be some suggestions we 
can make so things will generally work better.  The list doesn't accept 
attachments.  It's better if you use a paste website like 
http://www.fpaste.org/, choose the proper language for highlighting, and 
set the "delete after" setting to something that will work for you. 
Making it a paste that never gets deleted will mean that your message 
will retain usefulness for others as long as archives exist, but you 
might not want it available that long.

Properly tuning your garbage collection is important.  The default 
garbage collector is, risking a pun, garbage.

http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks,
Shawn

Re: "optimize" index : impact on performance

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: "optimize" index : impact on performance
: References: <13...@n3.nabble.com>
: In-Reply-To: <13...@n3.nabble.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.


-Hoss

"optimize" index : impact on performance

Posted by Anca Kopetz <an...@kelkoo.com>.

Hi,

We are trying to improve the performance of our Solr Search application in terms of QPS (queries per second).

We tuned SOLR settings (e.g. mergeFactor=3), launched several benchmarks and had better performance results, but still unsatisfactory for our traffic volume.
Then we optimized the index to 1 segment / 0 deleted docs and we got +40% of QPS compared to the previous test.

Therefore we thought of optimizing the index every two hours, as our index is evolving due to frequent commits (every 30 minutes) and thus the performance results are degrading.

1. Is this a good practice ?
2. Instead of executing an "optimize" many times a day, are there any other parameters that we can tune and test in order to gain in average QPS?

We want to avoid the solution of adding more servers to our SolrCloud cluster.

Some details of our system :

SolrCloud cluster: 8 nodes on 8 dedicated servers; 2 shards / 4 replicas
Hardware configuration: 2 Processors (16CPU cores) per server; 24GB of memory; 6GB allocated to JVM
Index: 13M documents, 15GB
Search algorithm : grouping, faceting, filter queries
Solr version 4.4

Best regards,
Anca Kopetz


________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: Skip Indexing Certain Files on Purpose

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yeah, roughly.

The example I added to the book (for EAR#5) is generalized and takes a field 
name parameter, a regular expression, to match, and a case sensitivity flag. 
And works for multivalued fields. And optionally logs docs that are skipped

-- Jack Krupansky

-----Original Message----- 
From: stone2dbone
Sent: Friday, August 02, 2013 9:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Skip Indexing Certain Files on Purpose

Jack, thanks for the response.  So, adding something as simple as the
following to the processAdd() function should do the trick in your opinion?

this_title = doc.getFieldValue("title");
if (this_title == "Slide 1"){
return false;
}

Regards,
ADS



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Skip-Indexing-Certain-Files-on-Purpose-tp4082026p4082211.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Skip Indexing Certain Files on Purpose

Posted by stone2dbone <an...@gmail.com>.

Jack, thanks for the response.  So, adding something as simple as the
following to the processAdd() function should do the trick in your opinion?

	this_title = doc.getFieldValue("title");
	if (this_title == "Slide 1"){
		return false;
	}

Regards,
ADS



--
View this message in context: http://lucene.472066.n3.nabble.com/Skip-Indexing-Certain-Files-on-Purpose-tp4082026p4082211.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Skip Indexing Certain Files on Purpose

Posted by Jack Krupansky <ja...@basetechnology.com>.

You could have a StatelessScriptUpdate processor detect the file type and 
then returns false, which aborts the update.

I'll be sure to add such an example to the next early access release of my 
book!

-- Jack Krupansky

-----Original Message----- 
From: stone2dbone
Sent: Thursday, August 01, 2013 2:17 PM
To: solr-user@lucene.apache.org
Subject: Skip Indexing Certain Files on Purpose

I'm using Nutch 1.6 to retrieve metadata from crawled documents (e.g. .doc,
.ppt, .pdf, etc.) for indexing by Solr 4.0. Several of the crawled files
have no value or a junk value for certain metatags. Is there a way to force
Solr to skip indexing of documents where, say metatag.title is empty or
metatag.title is 'Slide 1'?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Skip-Indexing-Certain-Files-on-Purpose-tp4082026.html
Sent from the Solr - User mailing list archive at Nabble.com.